UNITEXT 139 


Alberto Rotondi - Paolo Pedroni 
Antonio Pievatolo 


Probability, 
Statistics 
and Simulation 


With Application Programs 
Written in R 


Q) Springer 


UNITEXT 
La Matematica per il 3+2 


Volume 139 


Editor-in-Chief 


Alfio Quarteroni, Politecnico di Milano, Milan, Italy; Ecole Polytechnique Fédérale 
de Lausanne (EPFL), Lausanne, Switzerland 


Series Editors 

Luigi Ambrosio, Scuola Normale Superiore, Pisa, Italy 

Paolo Biscari, Politecnico di Milano, Milan, Italy 

Ciro Ciliberto, Universita di Roma “Tor Vergata”, Rome, Italy 

Camillo De Lellis, Institute for Advanced Study, Princeton, New Jersey, USA 


Massimiliano Gubinelli, Hausdorff Center for Mathematics, Rheinische 
Friedrich-Wilhelms-Universitat, Bonn, Germany 


Victor Panaretos, Institute of Mathematics, Ecole Polytechnique Fédérale de 
Lausanne (EPFL), Lausanne, Switzerland 


Lorenzo Rosasco, DIBRIS, Universita degli Studi di Genova, Genova, Italy; 
Center for Brains Mind and Machines, Massachusetts Institute of Technology, 
Cambridge, Massachusetts, USA; Istituto Italiano di Tecnologia, Genova, Italy 


The UNITEXT - La Matematica per il] 3+2 series is designed for undergraduate 
and graduate academic courses, and also includes advanced textbooks at a research 
level. 

Originally released in Italian, the series now publishes textbooks in English 
addressed to students in mathematics worldwide. 

Some of the most successful books in the series have evolved through several 
editions, adapting to the evolution of teaching curricula. 

Submissions must include at least 3 sample chapters, a table of contents, and 
a preface outlining the aims and scope of the book, how the book fits in with the 
current literature, and which courses the book is suitable for. 

For any further information, please contact the Editor at Springer: 
francesca.bonadei @ springer.com 
THE SERIES IS INDEXED IN SCOPUS 


shag 


UNITEXT is glad to announce a new series of free webinars and interviews 
handled by the Board members, who will rotate in order to interview top experts in 
their field. 

In the first session, going live on June 9, Alfio Quarteroni will interview Luigi 
Ambrosio. The speakers will dive into the subject of Optimal Transport, and will 
discuss the most challenging open problems and the future developments in the 
field. 

Click here to subscribe to the event! 

https://cassyni.com/events/TPQ2UgkCbJvvzSQbkcW X03 


Alberto Rotondi ¢ Paolo Pedroni 
Antonio Pievatolo 


Probability, Statistics 
and Simulation 


With Application Programs Written in R 


gD) Springer 


Alberto Rotondi Paolo Pedroni 


Dipartimento di Fisica Istituto Nazionale di Fisica Nucleare 
Universita di Pavia Universita di Pavia 
Pavia, Italy Pavia, Italy 


Antonio Pievatolo 

Istituto di Matematica Applicata e 
Tecnologie Informatiche 

Consiglio Nazionale delle Ricerche 
Milano, Italy 


ISSN 2038-5714 ISSN 2532-3318 (electronic) 
UNITEXT 

ISSN 2038-5722 ISSN 2038-5757 (electronic) 

La Matematica per il 3+-2 

ISBN 978-3-031-09428-6 ISBN 978-3-031-09429-3 (eBook) 


https://doi.org/10.1007/978-3-03 1-09429-3 


© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland 
AG 2022 

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether 
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse 
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and 
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar 
or dissimilar methodology now known or hereafter developed. 

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication 
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant 
protective laws and regulations and therefore free for general use. 

The publisher, the authors, and the editors are safe to assume that the advice and information in this book 
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or 
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any 
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional 
claims in published maps and institutional affiliations. 


Cover illustration: “The face number three and two numbers three, one chance over 1326” (photo by the 
authors) 


This Springer imprint is published by the registered company Springer Nature Switzerland AG 
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland 


Preface 


This book, based on the fourth Italian edition, comes from the collaboration between 
two experimental physicists and one statistician. Among non-statisticians, physicists 
are perhaps the ones who most appreciate and use probability and statistics, but most 
of the time in a pragmatic and manual way, having in mind the solution of specific 
problems or technical applications. On the other hand, in the crucial comparison 
between theory and experiment, it is sometimes necessary to use sophisticated 
methods which require knowledge of the fundamental logical and mathematical 
principles at the basis of the study of random phenomena. More generally, even 
those who are not statisticians have often to face, in any research field, problems 
that require particular attention and expertise for the treatment of random or aleatory 
aspects. These skills are naturally mastered by the statistician, whose research 
interests are the laws of chance. 

This text has been prepared with the aim to seek a synthesis between these 
different approaches, to provide the reader not only with tools useful to address 
problems, but also with a guide to the correct methodologies needed to understand 
the complicated and fascinating world of random phenomena. 

Such an objective obviously involved choices, sometimes even painful, both of 
content type and style. As for style, we tried not to give up the precision needed 
to properly teach the important concepts. When treating applications, we privileged 
the methods that do not require excessive preliminary conceptual elaborations. 

As an example, we have tried to use, whenever possible, approximate methods 
for interval estimation, with Gaussian approximations for the estimator distribu- 
tions. Similarly, in the case of least squares, we have extensively adopted the 
approximation based on the x? distribution to verify the fitting of a model to the 
data. 

We also avoided insisting on the formal treatment of complicated problems in 
cases where a solution using computer and simple simulation programs could be 
easily found. 

In our book, simulation plays an important role in the presentation of many topics 
and in the verification of the accuracy of many techniques and approximations. This 


vi Preface 


feature, already present in the first Italian edition, is now common to many data 
science texts and, in our opinion, confirms the validity of our initial choice. 

This book is aimed primarily at students of scientific undergraduate courses, such 
as engineering, computer science, and physics. However, we think that it can also 
be useful to all those scientific researchers who have to solve practical problems 
involving probabilistic, statistical, and simulation aspects. For this reason, we have 
given space to some topics, such as Monte Carlo methods and their applications, 
minimization techniques, and data analysis methods, which, usually, are only briefly 
mentioned in introductory texts. 

The mathematical knowledge required by the reader is that which is normally 
given in the teaching of the basic calculus course in the scientific degrees, with the 
addition of minimum notions of linear algebra and advanced calculus, such as the 
elementary concepts of the derivation and integration of multidimensional functions. 

The structure of the text allows different learning paths and reading levels. The 
first seven chapters deal with all the topics usually developed in a standard, basic 
statistical course. At the choice of the teacher, this program can be integrated with 
some more advanced topics from the other chapters. For example, Chap. 8 should 
certainly be included in a simulation-oriented course. 

The notions of probability and statistics usually taught to physics students in 
undergraduate laboratory courses are enclosed in the first three chapters, in Chaps. 6 
and 7 (basic statistics) and in Chap. 12, written explicitly for physicists and for all 
those who need to process data from laboratory experiments. 

Many pages are devoted to the complete resolution of several exercises inserted 
directly inside the chapters to better explain the covered topics. We also recommend 
to the reader the problems (all with solutions) reported at the end of each chapter. 

This book makes use of the statistical software R, which has now become the 
world standard for solving statistical problems. The 2019 ranking of the Institute 
of American Electrical and Electronic Engineers (IEEE) places R in fourth position 
among the most popular programming languages, after Python, Java, and C. Many 
R routines have been written by us, to guide the reader while going through the text. 
These routines can be easily downloaded from the link specified below. We therefore 
recommend an interactive reading, in which the study of a topic is followed by the 
use of R routines in the way showed both in the text and in the technical instructions 
included in the indicated Web pages. 

We thank again the readers who reported errors or inaccuracies present in the 
previous Italian editions, and the publisher, Springer, for the continued trust in our 
work. 


Pavia, Italy Alberto Rotondi 
Pavia, Italy Paolo Pedroni 
Milano, Italy Antonio Pievatolo 


March 2022 


How to Use the Text 


Figures, equations, definitions, theorems, Tables, and exercises are numbered 
progressively. 

The abbreviations of quotations (e.g., [57]) refer to the bibliographic list at the 
end of the book. 

Solutions of the problems are given in Appendix D. The table of symbols 
reported in Appendix A may also be useful. 

Calculation codes as hist are marked with a different text style. Routines 
starting with a lowercase letter are (with some exceptions) the original R codes, 
which can be freely copied from the CRAN (Comprehensive R Archive Network) 
website, while those starting with an uppercase letter are written by the authors and 
are in: 


https://tinyurl.com/ProbStatSimul 


In this site, you will also find all the information for the installation and the use of 
R, a guide to the use of routines written by the authors and complementary teaching 
materials. 


Contents 


Pert oo otc oc chdatenan dc tadgaandasdeaack de piancs adeg pees ncae eben nce’ 1 
11 Chance, Chaos and Determinism ...................cceeceeeeeeeeee 1 
se WOME BUSI TONG 5c. soos receasaidercauniseaeedase eee eines bane tee 8 
1,3 The Conceprot Prooabwity 3. 2s .cckeesicctwed son cdtss tee viene’ 10 
14 Axiomatic ProbabiHty ...5.c.20. cs ecascdsgccascc acs seeese se saees sees 13 
LS SCA Tals: acct settaneon chow rac thaeedins badteameeranaemnnen 19 
1.6 Elements of Combinatorial Analysis 2... 2... 5..2..0c.s0cs0e0s0es5 23 
Li? Bayes TMGOUCi ~j2s:co0s.0cche seu, oiteauw eine cameeiemensoniadbaceas 26 
13 Leatninie Aleorthms 2... ccceedsetaaes deciasck nesseeassceasene sues 33 
1.9 POORER inc sacwuigdadnerdiawdaniiaavsatudiusdedscdieqnessaesneeena 36 
2 Representation of Random Phenomena.............................0005 39 
21 TOPOGUCDON. is 3isafisn nevsdisk vive dh sa4 bes Fed Seaeeasncaseds cdeases’ 39 
pas Reancais Varales: s.scsisasesveniseussersass peace eawacxesecadece ses 40 
22 Cumulative or Distribution Function ............... 0. eee eee e eee ee 44 
2.4 Date Representation, :o2).ccs.yesedes ake esse eieass verwecctedewsens 46 
2.5 Discrete Random Variables. si 3.06. c.sii ss. e. bes ceet casas nce caes 49 
2.6 Binomial DisihibWion «2 .4.scysecshccssissacsiaveiaswecwe tere cecedas | 
2.7 Continuous Random Variables ........... 00... cece eee e eee e eee e eee 54 
20 Mean, Sum of Squares, Variance, Standard Deviation 
and Omantiles: . vcs disis cred te evades oeadesa eee ete cing ca seens ees 37 
29 QPRUAIONS: ccc pi seas ecyiecinsersenidineeesss rp eesd iasescheuenserse bee 64 
210 Smople Random Sample ...c.c::cccc0ehierseeiseeieciicscieeeie bes 67 
ail | Conversehiee erie ...05 cer eccoe dep thseinsestaeeaes verweseted es eins 69 
lS. PQDIEWIS 13.42 vend ea eisa dee yesa diet ese dees cade hbeieacsn ceases’ 73 
3. Basic Probability Theory .................. ccc cece esee eee nee eee ee eens 75 
aul PREM R sess ani ditealianinttenedan tices cawiodaeia tase peeeoeeewebe ees 75 
3.2 Properties of the Binomial Distribution .......................0085 75 
33 POlssOh EMSA joie ai adeavaaeiacesgueaaies adeoe eens bageeees das 79 
3.4 Normal or Gaussian Density «0.0.0.0. cccesee dene enescaeeeedes 81 


Contents 


ee, The Three-Sigma Law and the Standard Gaussian 

MOCUBHY op. v sian ecuseetsiecsieaye durante wong se eneaveen necne eos 87 
36 Central Limit Theorem and Universality of the Gaussian 

He as shared abe saok peseee tee ae etoss on arate seasehensaen an 91 
a Poisson Stochastic Processes: «...056.565 sg cease ces veka eoscous 94 
RE gd 1 ea ne er 101 
ao Mme DERBY ce ieces sd Sig ened 4 Gigs did i aeeeeavons cere taeees 108 
a3 “Chebyshev's Iiequalty: ..cy2nceecprecseacpeyeesaseecsesenedens es 112 
3.11 How to Use Probability Calculus ................ 0 eee eee eee 113 
Bie ~ WHMIS cscs te2s sei on eis gaecesace dred Sa wa aera eh Siete aeons 122 
Multivariate Probability Theory .....................c cee eee eeeee ee en ees 125 
4.1 TAUPO CHOU ico ihic daddies hae tha etwas dawsdetisnabenceseneenawous 125 
4.2 Multivariate Statistical Distributions .................c cece eee e eee 126 
4.3 Covariance and Correlation ............. cece cc cee eee eeeeeeeneeeens 134 
4.4 Two-Dimensional Gaussian Distribution .................0.00e0ee 139 
4.5 The General Multidimensional Case ................cccceeeeeeeeee 148 
4.6 Multivariate Probability Regions .................. eee ee eee eee 155 
47 Multinomial Distribution ............ 00... cece cece eee eee eeeeeeee 159 
4.8 PEODIGHIS ojo... dsscco55.ic saad cccasad ses sanoua paeaaauiosesene sens eeaanons 161 
Functions of Random Variables.......................000eeeeeeeeeeeeeeeee 163 
a1 TORE UONT 3503s cana iseseseaeuecn ewan sssasmeseeveeence tee ewets 163 
5.2 Functions of a Random Variable .... 0.5.5 cc.sccces cece ccieees cove 165 
33 Functions of Several Random Variables ..................00ceeeee 168 
5.4 Mean and Variance Transformation ................cceecceeeeeeees 184 
a5 Means and Variances for m Variables ...............ccccceeeeeeees 190 
5.6 PQ DIEWIG see dicdasiews 2 i440 d og ad 548 ese si 09 44S GSE noes LER Tae eo 197 
Basic Statistics: Parameter Estimation .....................000eeeeeeeeee 199 
6.1 TOMPOUW CWO oes. cossoieidciedis Gdacdice ds Gide sa dig aa vesdalanigusdeReeearePoenees 199 
6.2 Conindemcs TSW AIS. .as 5 cacec dese coiseaa bn va deco dcpnnieipendlby ewrereeteweint 201 
6.3 Confidence Intervals with Pivotal Variables ...................055 204 
6.4 Mention of the Bayesian Approach ................... ese eee ee eee 207 
6.5 DOME NOMMONS .oeiicccacdiccaiaw ds sadias wade dagen dbabbpesageeaeee 208 
6.6 Piobability Bstinanom: ..222¢ccnedenc-anetsneciaeeeaternesmmddaceees 209 
6.7 Probability Estimation from Large Samples ...................... 212 
6.8 Poissonian Interval Estimation ................ cece cee ee cece eeeees 218 
6.9 Mean Estimation from Large Samples .......................00085 222 
6.10 Variance Estimation from Large Samples .......................45 224 
6.11 Mean and Variance Estimation for Gaussian Samples ........... 229 
6.12 How to Use the Estimation Theory ................... eee ee eee eee 232 
613 Estimates frotma Finite Population... .c..2.062s.0c.s0c055c0ccerbaees 239 
14 ESO Sram ANAIVGS: 6.25 leet etacsaew sana cianeemmensoniaddaces os 242 
6.15 Estimation of the Correlation .......... 0.0... cece cece cee eee e eee eee 248 
CE PE ooh cd cdeidadinceia deat ncdauaeaorgan taadeeadancieeeeben dene 257 


Contents x1 
7 Basic Statistics: Hypothesis Testing ..................... 0.0: cece eee eeeees 259 
fel Pesto g ne Hy pomests:. . os. 29.4. sess ce pteven es verwescnedeneses 259 
te The Grasstan 2-Test oic.si dss cve des ives ideas een ddacdetvacsendecne’ 263 
te PUGS RT NER nodes ecckateencskaesedeckstevecetadwackelececstdcas 269 
7.4 MTD aee ESE is onch bel ioe hiGies etl de lie bis Linwccds nels orks oes 274 
te Compatibility Check Between Sample and Population .......... 277 
7.6 Hypothesis Testing with Contingency Tables ..................... 285 
pee Raultiple Teste co. .ecyeersacnyeees soup ented eyentssaseeesenede mets 291 
7.8 PUICHOOOE S FTES 2 ieee laciseeldiciaesidacsaweddacieects acta dees aed 298 
7.9 Analysis of Variance (ANOVA) .............. cece cece ence eee e eens 299 
Folth “Two-Way ANOVA 5 i5csedsieised biked Sth deseo cs eviedek 309 
eck. ARNGIIE coca tecobtarctstevenct te seanctstebdretadeackeledsaerdses 315 
§ Monte Carlo Methods |... 2.20.02. 5.0. ede sees cece cecteebeneeieseeeeeces 319 
8.1 TAIROBUCH ON cic cvasian daw vaio iaaesnasanweaneuawbenionesdbenemeranses 319 
es AVAL TS WICH ONO? 2. 5.5.4. cass aaaec seca adea ads oe eae age nies dee 320 
8.3 WATS AtICAl ASHES: os.ccs ec oeiinnrden tineedend ademas canennees 323 
8.4 Generation of Discrete Random Variables ....................0055 324 
8.5 Generation of Continuous Random Variables .................... 328 
8.6 Exenr Seater MEMO ios auilavinscduasgiadis age seetaeeeee wens ces 334 
8.7 BS peCHMd ICHIGO ocd diene due Soe rete todacawesokermessansemene 336 
8.8 Particular Random Generation Methods ......................0045 343 
8.9 Monte Carlo Analysis of Distributions ........................0085 348 
8.10 Evaluation of Confidence Intervals ............... cece eee eee eee 35'1 
8.11 Simulation of Counting Experiments .........................000. 355 
$.12 Non-parametric Bootstrap ... 2... icc ace sccedisinsscasciesveesetas 359 
8.13. Hypothesis Test with Simulated Data ......................2. 0008. 364 
SAL TNS IIS: iss sessed gssiestca sins s cinpaacesdis a ninaatiand sane aude ube peep eda patiepeees 366 
9 Applications of Monte Carlo Methods ........................:eeeee eee 369 
9.1 Weil «occ seestns ecco ncaa ey avenge etree y seuesnas seweets 369 
a2 Study of Dittusian Phenomena, 5.2.4. .2502.4.eceerss sce ces esses 369 
93 Simulation of Stochastic Processes .................ccceee coer neces 377 
9.4 Number of Workers in a Plant: Synchronous Simulation ........ 382 
95 Number of Workers in a Plant: Asynchronous 
PUMMUGUON 2 oc ssedict bes bist eeae dees esa dees seats haricadeaeeecees 385 
9.6 Kolmogorov-Smurmoy Test: 2.025.025 secseecyeecesaseecve ses sseseens 388 
oT Metropolis Alsmrthin 5.4 i.2250023 1820s eh eSlavels ecisorsd aig eek Sk 393 
9.8 Tes WIGS nes ees anesteuhedcpeeupe inet ene erst aeegeasscuescte cow eee’ 397 
Epes Definite Integral Caloulation .. 0.5... 0002. .00sc.cci sees ees e eaves 400 
8.10 Iniportance Sample ons. cys ccseacpesssess peste sesesrsesenedeneses 404 
Bll «-Strdtived Sampling 522 bicieelpiecieeierseete cei eenistcieeeks oes 405 
8.12 Moltidamensional Intetrals 2.25.6. .cccgecsstaeeees vervescnedes sess 410 
as C3 3] (110: en eee ee eee eee mer rey 410 


xii Contents 
10 Statistical Inference and Likelihood ....................... sees eee eee 413 
WUD POCO wk ection s ase da aesadess vega ee caeeedee voeree ea edesse 413 
10.2. Maximum Likelihood (ML) Method ...................0ce cece eens 415 
3 © Estimator Properties i. ....ccyvecs.esiccrsisty eect seco deve bereseseees 420 
pa “Theeremis om Eistisiators: iiss. ecg e 4b oiie ets bese che ans orks 8 423 
WS  Conidence Intervals occ sseesaceciessseesascesss ysonese te sevege 434 
10.6 Least Squares Method and Maximum Likelihood ................ 437 
10.7 __ Best Fit of Densities to Data and Histograms ..................... 439 
1s “Werchter Meat: .iibiossbbi soe ile heeds Sven diinecis ceieeeie oes 444 
WS “Test of Hypotheses: ..ioc..c.s.eticeyeeesa cepaesenes vooeeevaeces ee 450 
11. One-nr Two-Sample Tests 2. ...5.3 sine sos d nde Skt ces eases 452 
I  Mipst Powertil Wests ..225.ccyvccsiecpentehstredeeiioeecvebeceseseees 456 
PS. Bet UMN: 2355 So SP Sweh SS Taye eile beay eek aise aes tes 459 
LS eg Heritial Tse coc siacuesesseeiaseaneeiaeoreemageaveoneena ese ee 465 
CUDA Pre USMS sy. soc Fie nese ene ise tivedisa des edd t oeneees aieets wdea need 471 
ID SDROSt SGUATOS: (ccs. dalcaceeiecuie dee ies Haecinet ebtebeneienedeennens 475 
The OUMNH. so .sicicd cc.ceeisa ceawed cd deas cise ded dase SOs esaenaa des 475 
Li2 Ne Barons 00 Predictors sicicicncsncvaag cose vawiered coscermbiwn bore 477 
113 Terrors i Predncnors... <..2: seccacuaaecasesdadeisecs cs aetausceasena aces 481 
11.4 Least Squares Regression Lines: Unweighted Case ............. 484 
11.5 Unweighted Linear Least Squares ..................ee eee eeeee eee 491 
lla . Weighted Linear Least Squares... ...c.... 00 cc0cd concedes nner den es 495 
11.7. Properties of Least Squares Estimates .......................00085 499 
11.8 Model Testing and Search for Functional Forms ................. 502 
TU Search tor Correa coisas edawec cg. acas cigs ace ieee eee saste he ee S11 
LAG Toe SWACBIES: icc ccnda cee eadsanadentia wwosncdemsiuencnaaniasdecenes 516 
TL11 Nonlinear Least Squares. 2.0.6 i. cecccee de ecisec nenseeass ceases es ces 517 
WDA WO BleiIie isc ccceiesci saci ddaedauhisc ished caeeeiedenneslaeeeanins 520 
12 Experimental Data Analysis .................. 0. ce scee eee e ee ee eee ea ee eee 523 
| ANU Ve 302 0 0 7 ean ge ee eee 523 
22: “Wer BY cco) isestscsesspscctaanbeeesanayeegy seeecchecesadese bes 524 
12.3. Constant and Variable Physical Quantities ......................4. 525 
12.4 Instrumental Sensitivity and Accuracy ....................e eee ee 526 
(23> Measurement Uncertamty .<.5.5. 2... cess dec cin deeso neces eed 529 
12.6 ‘Treatment of Systematic HGSCts: ....22..5essccpsasyece ten ssesesns 532 
12.7 _—_ Best Fit with Offset Systematic Errors .....................22 e000 536 
12.8 Best Fit with Scale Systematic Errors ..............50000.sss00000> 540 
12.9 Indirect Measurements and Error Propagation .................... 542 
2AM Measurement TYPES oic260i5 peecieceyesssassyesadiesescteleesarsecn 551 
12.11 M(O, 0, A) Measurements .................cccccee cece eee eeeeeees 552 
12.12 M(0, o, 0) Measurements ............... ccc cece cece eee eeeeeees 553 
12.13 M(0, o, A) Measurements ................ccccceece cette ee eeeeees 556 
12.14 Mf, 0, 0) Measurements .....526.sc..0 cee ceeiesseenecesaeensens 558 


Contents xill 
12.15 M(f, o, 0), M(f, 0, A) and M(f, o, A) 

MICSSICEMIGUIR: iis. acsauesianiatysseseees heed rss descagece gauss 565 

12.16 A Case Study: Millikan’s Experiments ....................0...0005 569 

12.17 Some Remarks on the Scientific Method ..................:eeeees 572 

W278 Peeples s coisvecd elses die ies elie esd oad ay laacas oth acke oced oes 578 

A _ Table of Symbols................. 0c. cee cee ence cence ee enee ee nneeeneaeeennas 581 

B. (RR SOtG Wate ec yineccscpoocseacyaduhese pines ate eens a enenerepaeresedgpeenelies 583 

C Moment-Generating Functions ................... cece eee e eee e eee eee 587 

D_ Solutions of Problems .................. 00. ccce cence cence ee eneeeeeaeeeenas 591 

Be TRGIMGS ich i8 stag cdena cay Wadeden pieeeas lise hdaudadagehioebiennismbegmecaes 615 

E.1 Intestal of the Gaussian Density... 22.0 cise ccccssseese cas aenbenes 615 

E.2 Quantiles of the Student’s Density .................e eee e eee eee 616 

E.3 Integrals of the Reduced x7 DSHS os ccss chi dsoeessanbaensonces 616 

E.4 Quantile Values of the Non-Reduced x? DONSUIY sets con teneces 616 

E.5 COuigmtiles G1 the # WWCRSUY 2.0. 5ccd ceecas ci csees tess Sabetaaeeaee eas 617 

Biblioeraphy «...2 50.5.8 stead suk ae ne Ste ead eis eso aos 625 

VON coi Se iets haytd on vieg iiae aoan id adadaeldasideuuiadageliseleenkisaebameens 631 


About the Authors 


Alberto Rotondi is formerly Full Professor of Nuclear Physics at the University 
of Pavia (Italy), where he is now Adjunct Professor of Data Analysis in the 
Physics Department. He is the author of several hundred publications in the field of 
experimental nuclear physics. During his research activity, carried out mainly at the 
CERN laboratories of Geneva, he applied many statistical and simulation methods 
to the project of new experiments and to the analysis of experimental results. 


Paolo Pedroni is a senior researcher at INFN (Italian National Institute of Nuclear 
Physics), Section of Pavia, and an adjunct professor in the Physics Department at 
the University of Pavia (Italy), where he teaches the course Statistical Methods 
in Physics. He has been carrying out experimental research on the interactions 
of quarks within protons and neutrons in the framework of several international 
collaborations. 


Antonio Pievatolo is director of research at CNR IMATI in Milan, Italy, where 
he works in the applications of statistics in industry and technology. He has led his 
local unit in applied research projects commissioned by national and regional public 
bodies and companies. He has been president of the European Network for Business 
and Industrial Statistics (ENBIS) from 2017 to 2019, and Contract Professor of 
Statistics at the University. 


XV 


Chapter 1 ® 
Probability cha 


There seems to be no alternative to accepting some sort of 
incomprehensible quality to existence. Take your pick. We all 
fluctuate delicately between subjective view and objective view 
of the world, and this quandary is central to human nature. 


Douglas R. Hofstadter, “THE MIND’S I”. 


1.1. Chance, Chaos and Determinism 


In this introduction, before looking into the phenomena known as casual, stochastic 
or random, we will briefly analyse the importance and the role of these physical 
processes into our reality. 

At the beginning of a scientific measurement or observation of a natural 
phenomenon, one usually tries to identify all the causes, conditions and external 
factors that determine its evolution. Subsequently, one operates in order to keep 
these external causes fixed or, as much as possible, under control, and then one 
proceeds to record the results of the observations. 

When repeating the observations, two situations can occur: 


¢ One always gets the same result. As an example, think of the measurement of a 
table with a commercial meter tape. 

¢ One gets a different result each time. Think of a very simple natural phenomenon: 
the toss of a coin. 


While, for the moment, in the first case there is not much to say, in the second case, 
we could ask ourselves what causes the observed variations of the results. Possible 
reasons are not having checked all the conditions that influence the phenomenon or 
having incorrectly defined the quantity to be observed. Once these corrections have 
been applied, the phenomenon can become stable or continue to show fluctuations. 

Let’s explain with an example: suppose we want to measure the time of sunrise 
on the horizon at a given location. We will observe that repeated measurements 
in successive days give different results. Obviously, in this case the variation of 
the results is due to a bad definition of the measure. The time of sunrise must be 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_1 


2 1 Probability 


measured in a certain place and for a certain day of the year and must be repeated 
one year later in the same day and place. Redefining the observation in this way, 
the results of repeated measurements coincide. As it is well known, the laws of 
planet motion provide in this case a model that allows to predict (apart from small 
corrections that we don’t want to discuss) the times of dawn for every day of the 
year. ! 

Let us now consider the measurement of another quantity, the temperature at 
a certain time of a day. In this case, even considering a certain day of the year 
and repeating the measurements from year to year, different results are observed. 
Unlike the time of sunrise, we are not in possession of a model that allows us to 
accurately predict the result. Why does the temperature, unlike dawn, seem to have 
an inherently random behaviour? The reason is that, while the time of dawn depends 
on a few body interactions (the sun and the planets), the temperature depends not 
only on astronomical conditions but also on the state of the atmosphere, which is 
the result of the interaction of countless factors, which not even in principle can be 
determined with absolute precision or, in any case, kept under control. 

This distinction is crucial and is the key to establish the difference between quan- 
tities that fluctuate and those that appear to be fixed or are accurately predictable 
based on deterministic models. 

Historically, deterministic systems have been considered, for a long time, free 
of fluctuations, and their study, in the context of classical physics, is continued in 
parallel to that of the systems called stochastic, casual or random, born with the 
study of gambling: toss of dices, card games, roulette, slot machines, lotto games 
and so on. The latter systems are specifically designed and built to ensure the 
randomness of the results. There were therefore two separate physics domains: the 
one of deterministic phenomena, without fluctuations, governed by the fundamental 
laws of classical physics, usually consisting of simple systems (generally few bodies 
systems), and the world of the random phenomena, subject to fluctuations, often 
related to complex systems (usually consisting of many bodies). 

However, already at the beginning of the last century, the French mathematician- 
physicist H. Poincaré noticed that, in some cases, the knowledge of the deterministic 
laws was not enough to make exact predictions on the dynamics of some systems 
starting from known initial conditions. The problem, which today is called the study 
of chaos, was thoroughly investigated only much later, starting from the 1970s, 
thanks to the help of computers. Today, we know that, in macroscopic systems, 
the origin of the fluctuations can be twofold, that is, due to deterministic laws which 
present high sensitivity regarding the initial conditions (chaotic systems) and due to 
the impossibility of defining in a deterministic way all the variables of the system 
(stochastic systems). One of the best paradigms for explaining chaos is the logistic 
map, proposed since 1838 by the Belgian mathematician P.F. Verhulst and studied 


' Actually we do not know exactly how stable the solar system is. Some models indicate that 
forecasts cannot be extended beyond a time interval of the order of one hundred million years 
[AANt07]. 


1.1 Chance, Chaos and Determinism 3 


in detail by the biologist R. May in 1976 and by the physicist M. Feigenbaum in 
1978: 


x(k +1) =Ax() TL — x1, (1.1) 


where k is the population growth cycle, 4 is related to the growth rate and 0 < 
x(k) < 1 is a state variable proportional to the number of individuals in the 
population. The condition 0 < A < 4 assures that x remains within the fixed limits. 
The logistic law well describes the dynamics of evolution of populations where there 
is an increase per cycle proportional to 4. x(k) with a negative feedback —) x(k) 
proportional to the square of the size already reached by the population. 

Without going too far into the study of the logistic map, we notice that the 
behaviour of the population evolves with the number of cycles according to the 
following characteristics (also shown in Fig. 1.1): 


¢ When A < | the model always leads to the extinction of population. 

¢ When | < A < 3 the system reaches a stable level, which depends on A but is 
independent of the initial condition x (Q). 

e When 3 < A < 3.56994... the system oscillates between some fixed values. For 
example, as shown in Fig. 1.1 for A = 3.5, there are four possible values (in the 


X.=0.8 X=2.5 
0.32 0.68 
x | x) 
0.24 oor 
0.52! 
0.16; ; 
] 0.44; 
0.08! tae 
09 10 20 30040 | 50 asl 10 20 30 40 A) 
re AN =3.5 ai XN =3.8 
x xX J 
0.71 aid 
0.61 
0.5) ] 
] 0.4 
0.3 0.2 
0 10° 20° 30° 40 | 50 0 10° 20° 30 40 | 50 


Fig. 1.1 x values from logistic equation (1.1) having as initial starting value x = 0.3 for different 
A values 


4 1 Probability 


figure a continuous line joins the discrete x values). Also in this case the states 
reached by the system do not depend on the initial condition. 

e When A > 3.56994... the system is chaotic: the fluctuations seem regular, but, 
as can be seen by looking carefully at Fig. 1.1 for A = 3.8, they are neither 
periodic nor do they seem entirely random. A thorough study also shows that 
the fluctuations are not even predictable precisely, because the initial condition 
values x(0) very close to each other lead to completely different evolutions. 
This phenomenon, which is called sensitive dependence on initial conditions 
or butterfly effect,” is one of the main characteristics of chaos. Note that the 
fluctuations in chaotic systems are objective, intrinsic or essential, since the 
reproducibility of the results would require initial conditions at an accuracy 
level comparable to that of the atomic scale, which is not possible, not even in 
principle. 


You can gain numerical experience with the logistic map and check the butterfly 
effect using our R Logist and LogiPlot routines? with which we produced 
(Fig. 1.1). 

The methods for distinguishing the chaotic systems from the stochastic ones 
are based essentially on the study of dispersions, that is, the difference between 
the values of the same state variable in subsequent evolutions of the system, as a 
function of the whole number of the state variables. 

In a chaotic system, once a certain number of variables have been identified, 
the deviations stabilize or tend to decrease. This behaviour indicates that a number 
of state variables adequate to describe the system have been reached and that the 
deterministic law that regulates its dynamics can be obtained. The fluctuations in 
the results of repeated experiments in this case are attributed, as we have seen, to 
the sensitivity of the system with respect to the initial conditions. 

In a stochastic system, conversely, the number of state variables needed for the 
complete description of the system is never reached, and the sum of the deviations, 
or the quantities connected to them, continues to grow with the number of state 
variables considered [AAN* 07]. The fluctuations of the system variables appear 
random and follow the distributions of probability theory. 

The study of chaos and of the transitions from chaotic to stochastic states (and 
vice versa) is a very recent and still open research area, where many problems still 
remain unsolved. The interested reader can enter into this fascinating topic through 
the introductory readings [AAN*07, Rue96, Ste97]. 

In the remainder of the book, we will not deal with chaos, but we will instead 
devote ourselves to the study of random or stochastic systems, that is, of all the 
systems in which, as we have previously noted, there are variables following, in 
principle, the statement:+ 


? Referring to chaos in meteorological systems, it is often said: “a flap of butterfly wings in the 
tropics can trigger a hurricane in Florida’. 


3 Most of the original R routines start with a lowercase letter, ours with a capital letter. 
4 Tn the following the non-mathematical operational definitions will be called “statements”. 


1.1 Chance, Chaos and Determinism 5 


Statement 1.1 (Random Variable in a Broad Sense) A stochastic, random or 
aleatory variable is the result of the interaction of many factors, each of which 
is not dominant over the others. These factors (and their dynamic laws) cannot be 
completely identified, fixed and in any case kept under control, not even in principle. 


In the present book, we will mainly use the term “random variable”. Let us now try 
to identify some stochastic systems or processes which in nature produce random 
variables. All many-body systems have a very high degree of randomness: the 
dynamic observables of molecular systems, ideal gases and thermodynamic systems 
generally follow Statement 1.1 very well. These are systems studied by statistical 
physics. 

At this point we can specify the meaning of the term “factors and dynamic laws 
impossible to determine, not even in principle” we used in Statement 1.1. Suppose 
we roll a dice 100 times. To build a deterministic model that can predict the outcome 
of the experiment, it would be necessary to introduce in the dice equations of motion 
all the initial conditions of the toss, the constraints given by the surfaces of the 
hands or the cup in which the dice is shaken before throwing, the constraints given 
by the table where the dice falls down and perhaps more. We would thus have a 
huge set of numbers, describing the initial conditions and constraints for each one 
of the hundred tosses, enormously larger than the one hundred numbers giving the 
final result of the experiment. Clearly, the predictive power of such a theory and its 
practical applicability are totally absent. A deterministic model, to be such, must 
be based on a compact set of equations and initial conditions and must be able to 
predict a vast set of phenomena. 

For example, this is the case of the logistic law (1.1) or of the simple law of the 
fall of the bodies, which connects the path space s to the gravitational acceleration g 
and to the fall time f through the formula s = gt? /2. This formula alone allows you 
to predict, assigning s or ¢ as the initial conditions, the results of any experiment. 

We can summarize the above considerations by saying that a deterministic model 
becomes meaningless when it generates algorithms requiring a numerical set of 
initial conditions, constraints and equations enormously larger than the set of results 
that one intends to predict. Alternatively, one should use the statistical approach 
which, based on the a posteriori study of the results obtained, try to quantify the 
extent of the fluctuations and extract global regularities that can be useful for the 
prediction of future results. 

This line of thinking, developed during the last three centuries, arrived, by 
studying the pure stochastic systems, at identifying the fundamental mathematical 
laws for the description of random phenomena. The set of these laws is now known 
as the probability theory. 

All the books dealing with probability theory, including the present one, make 
extensive use of examples taken from the games of chance, such as dice throwing. 
These examples well delineate the essence of the problem, because only by 
studying pure stochastic systems it is possible to discover the laws of chance. Great 
mathematicians and statisticians, like P. Fermat (1601-1665), P.S. Laplace (1749- 


6 1 Probability 


1827) and J. Bernoulli (1654-1705), often discuss experiments they performed with 
dices, cards or other devices taken from games. One of their goals was precisely 
to provide winning strategies for gambling games, which were already widespread 
at that time and that they played too. In this way they set the foundation of the 
probability calculus and statistics, based exclusively on experimental facts, as the 
scientific method requires. 

In addition to traditional games, today there is another “artificial” laboratory, 
consisting of computer-generated random processes. As we will see, it is indeed 
possible to simulate pure stochastic systems of any kind and complexity using a 
uniform random number generator (a kind of electronic roulette): rolls of the dice, 
card games, many-body physical systems, and more. 

These techniques, named Monte Carlo (recalling the homeland of the games of 
chance) or simulation methods, are very practical and effective, because they allow 
to obtain artificial datasets in a few seconds, whereas a real experiment in some cases 
would take years. However, it is important to note that, conceptually, these methods 
do not introduce new elements. The aim is always to obtain random variables 
from models consisting of stochastic systems also including, when necessary, 
deterministic components. These data are then used to develop and optimize the 
logical-mathematical tools to be applied to the study of real systems. 

And now let’s start to examine real systems in general. For example, consider 
Fig. 1.2, which represents the average temperature of the earth’s surface over the 
past 142 years. As you can well imagine, our future depends on the trend of 
this curve in the next years. Comparing “by eye” this curve with that of Fig. 1.3, 
representing a pure stochastic process, it seems that, starting from the beginning of 
the last century, an increasing trend is superimposed to a random behaviour. We do 


1- °C 


; u Japs nal 
ae 


-0.565 \ 1 1 ! f L \ ! ; ! f L , | 
1880 1900 1920 1940 1960 1980 2000 2020 
year 


Fig. 1.2. Average global terrestrial surface temperature for the period 1880-2021. The line at zero 
represents the average of the years 1951-1980 [Tea22, LSHT 90] 


1.1 Chance, Chaos and Determinism dL 


10 


heads 


1.57 


7 


2.5 


1 | 1 | 1 | 1 | 1 | 1 
% 20 40 60 80 100 120 


toss 


Fig. 1.3, Computer simulation of the number of heads obtained by throwing 10 coins in 120 tosses. 
The progressive number of tosses is reported on the abscissas, the number of heads in ordinates. 
The continuous line is the expected mean value (five heads). Compare this figure with Fig. 1.1 for 
X = 3.8, which displays chaotic fluctuations 


not go further into this rather alarming example that just served us to show that, in 
real cases, the simultaneous presence of both stochastic and deterministic effects is 
very common. 

To account for these possible complications, the study of a real system is 
performed with a gradual approach, according to the following steps: 


(a) To identify the purely stochastic processes of the system and deduce, based on 
the rules of probability and statistics, their evolution laws. 

(b) To separate stochastic from non-stochastic components (sometimes called 
systematic), if any. This step is generally performed using statistical methods. 

(c) If the problem is particularly difficult, to perform a computer simulation of the 
system on the computer and compare the simulated data with the real ones. 


It is often necessary to repeat steps (a) to (c) until the simulated data are in a 
satisfactory agreement with the real ones. This recursive technique is a powerful 
method of analysis and is now applied in many fields of scientific research, from 
physics to economics. 

Before closing this introduction, we would like to mention what happens in the 
microscopic world. Let us consider, for simplicity, a system consisting of a single 
subatomic particle as an electron. In this case the fundamental equations of physics 
provide a complex state function y(r) whose square modulus gives the localization 
probability of a particle in space: P(r) = |y(r)|*. The probability thus defined 
obeys the general laws of probability which will be described in the following. 


1 Probability 


Since the fundamental laws of the microscopic world contain a probability 


function and so far no one has been able to find more basic fundamental laws 
based on different quantities, one deduces that probability is a fundamental quantity 
of nature. Indeterminism, being present in the fundamental laws that govern the 
dynamics of the microscopic world, assumes in this case an objective character 
(called non-epistemic), not linked to ignorance or limited abilities of the observer. 


1.2 Some Basic Terms 


Here we informally introduce some fundamental definitions of current use in the 
study of stochastic phenomena. In the following, these terms will gradually be 
redefined in a mathematically rigorous way. 


Sample space: it is the set of all possible different values (cases) that a random 
variable can assume. For example, the random variable card of a playing deck 
gives rise to a sample space of 52 elements. The structure of the space depends 
on the way used to define the random variable. In fact the space relative to the 
random variable card of a playing deck is consisting of 52 cards, or 52 integer 
numbers if we create a correspondence between cards and numbers. 

Event: it is a particular combination or a particular subset of cases. For example, 
in the case of playing cards, if you define an event as an odd card, the set of cases 
obtained is 1, 3, 5, 7, and 9, for each of the four colours. This event gives rise to 
a subset of 20 elements selected among the 52 elements of the sample space (all 
the cards in the deck). 

Spectrum: it is the set of all the different elements of the subset of cases defining 
the single event. For odd playing cards, the spectrum is given by 1, 3,5, 7, and 9. 
Obviously, the spectrum can coincide with the entire space of the random variable 
under study (if, e.g. the event is defined as any card of a deck). 

Probability: is the quantitative evaluation of the possibility of obtaining a certain 
event. It is evaluated based on experience, using mathematical models or even 
on a purely subjective basis. For example, the probability that, at this point, you 
continue reading our book is, in our opinion, 95% ... 

Trial: it is the set of operations that realize the event. 

Experiment, measurement, sampling: it is a collection of trials. The term 
familiar to statisticians is sampling, whereas the physicists usually use the term 
experiment or measurement. In physics an experiment can be a sampling, but not 
necessarily. 

Sample: it is the result of an experiment or sampling. 

Population: it is the result of that number of trials, finite or infinite, which run 
through all the possible events. For example, in the lottery game, the population 
can be the finite set of all possible combinations of 5 numbers drawn from an 
urn of 90 numbers; in the case of the height of the Italians, we can imagine the 
set of measurements of the heights of each individual. When the population is 


1.2 Some Basic Terms 9 


thought as a sample of an infinite number of elements, it should be considered as 
a mathematical abstraction not achievable in practice. 


These ideas can be summarized as in Fig. 1.4. Once the elementary probabilities 
have been assigned to the elements of the sample space (inductive step), using 
probability theory one can calculate the probability of all events, thus deducing 
mathematical models for the population (deductive step). Instead, by running a 
series of measurements, one can get a sample of events (experimental spectrum) 
representative of the population under consideration. Then, using the statistical data 
analysis (inductive/deductive step), one tries to identify, from a detailed examination 
of the sample, the properties of the parent population. These techniques are called 
statistical inference. Once a model has been assumed, it is possible to verify 
its congruence with the collected data samples. This method is called hypothesis 
testing. 

In this text, the fundamentals of probability calculus will be at first explained with 
particular regard to the assignment of elementary probabilities to the components of 
the sample space. Then, calculus and combinatorial analysis will be used to obtain 
some fundamental mathematical models of populations. Afterwards, the methods of 
statistical analysis will be explained. They allow to estimate, starting from measured 


Sample Event 
space (ensemble of cases) 
probability 
\ galeulus 
Population 


| statistics 


rial * measurement / 


| 
Sn, 


Fig. 1.4 The relationships between probability calculus, statistics and measurement 


10 1 Probability 


quantities, the “true” values of physical parameters or to verify the congruity of 
experimental samples with mathematical models of population. The elements of 
probability and statistics previously acquired will then be extensively applied to 
simulation techniques. 


1.3. The Concept of Probability 


Experience shows that, when a stochastic or random phenomenon is stable over 
time, some values of the spectrum occur more frequently than others. If we flip ten 
coins and count the number of heads, we see that the outcome of five heads occurs 
more frequently than eight, while ten heads is a really rare, almost impossible, event. 
If we consider an experiment consisting of 100 trials (where each trial is the toss of 
10 coins), we observe that the number of times one gets 5, 8 and 10 heads is quite 
regular, even if with little variations from experiment to experiment, because the 
values 5, 8 and 10 always (or almost always) show up with decreasing frequency. 
If we imagine all the possible alignments of 10 coins, we can have an intuitive 
explanation of this fact: the event 10 heads (or 10 crosses) corresponds to only one 
alignment, while for the event 5 crosses (or 5 heads) many possible alignments 
are possible. (5 tails and then 5 heads, tails-to-heads alternately, and so on up to 
252 different alignments). When tossing ten coins, we then choose at random, on 
the same footing, one of the possible alignments, and it is intuitive that almost 
always we will get balanced results (more or less five heads) and almost never the 
extreme cases. A reasoning of this type, common to everyone’s daily experience, 
leads instinctively to think that this regularity of the stochastic phenomena is due 
to the existence of fixed quantities, called probabilities, that one can define, for 
example, as the ratio between favourable and possible cases (alignments). These 
considerations led J. Bernoulli to the formulation of the first mathematical law able 
to predict the trend of the results in experiments such as the coin toss, taking also 
into account the random fluctuations. 

In the case of coins, the probability is introduced to account for the variability 
of experimental results; however, each of us uses probability also to manage the 
uncertainty of many non-repeatable situations that occur in real life, quantifying 
subjectively the realistic possibilities and choosing those with the highest probabil- 
ity, taking into account the resulting costs or benefits. 

For example, when we are driving the car and we meet a red traffic light, we 
have two options: stop or continue. If, around noon, we are crossing in a high traffic 
road, we surely stop, because we know, based on our experience, that the collision 
probability with other vehicles is very high. Instead, if we are in a low traffic road 
in the middle of the night, we are tempted to continue, because we know that the 
probability of a collision is very low. 

Another example of a subjective and discrete probability is given by the 
judgement of a defendant in a trial by a jury. In in this case, the probability can be 
expressed with two values, 0 or 1, i.e. guilty or innocent. In general, current jurispru- 


1.3. The Concept of Probability 11 


dence formulates the final judgement combining subjective individual probabilities 
expressed by the individual jurors. 

Given these observations, the approach currently considered more appropriate, 
effective and ultimately cheaper for the study of random phenomena is to consider 
the choice of probability as a subjective act, based on experience. A first possible 
effective definition of probability is: 


Statement 1.2 (Subjective or Bayesian Probability) The probability is the sub- 
jective degree of belief about the occurrence of an event. 


The subjective probability is free, but it is generally assumed that it must be 
consistent, that is, expressed as a real number 0 < p < 1, p = 1 for a known 
event and p = 0 for an impossible event. Then, considering two or more exclusive 
events (like the faces 2 or 4 on a die roll), consistency requires their probabilities 
to be additive. These assumptions are sufficient for the axiomatization according to 
the Kolmogorov scheme, which will be presented shortly. 

The subjective probability is widely used in soft sciences such as jurisprudence, 
economics, part of medicine, etc. In hard sciences as physics (we will specify better 
later, in Chap. 12, the meaning of the term “hard science’), the subjective probability 
is generally avoided and the definitions of a priori and frequentist probabilities are 
used (Laplace, 1749-1827) (Von Mises, 1883-1953). 


Definition 1.3 (Classical or a Priori Probability) If N is the total number of cases 
of the sample space of a random variable and n is the number cases with outcome 
A, the classical or a priori probability of A is given by: 


n 
P(A)= 5. (1.2) 


For example, the a priori probability of a given face when throwing a fair die is: 


P(A) = no number of favorable cases _ 1 
~N number of possible cases 6’ 


while the probability of drawing the ace of diamonds from a deck of cards is 1/52, 
the probability of extracting a suit of diamonds is 1/4 and so on. 


Definition 1.4 (Frequentist Probability) If m is the number of occurrences of 
outcome A over a total of M trials, the probability of A is given by: 


P(A = lim = . (1.3) 


M->oo 


The limit appearing in this definition has an experimental meaning rather than a 
mathematical one, because the true probability should be found only by carrying 
out an infinite number of trials. In the following, we will call this operation, with 
the limit written in italics, as frequentist limit. 


12 1 Probability 


The choice of the elementary probabilities to be assigned to the different events 
is therefore inductive and arbitrary. The probability calculus applied to complex 
events starts from arbitrarily assigned elementary probabilities and then proceeds 
deductively, as we shall see, without departing from mathematical rigor. The use of 
subjective probabilities is also called Bayesian approach, because in this case the 
initial probabilities are often readjusted according to the results obtained using the 
famous Bayes’ formula, which we will soon deduce in Sect. 1.7. 

The frequentist approach is the one prevalent in physical and technical frame- 
works. Based on our experience, we believe that in experimental physics the 
frequentist approach is followed in 99% of cases, and this is a ... subjective 
evaluation! Within this framework, it is believed that Eq. (1.3) allows the “objective” 
evaluation of probability for those natural phenomena that can be easily sampled 
or easily repeated in the laboratory. In many cases, experience shows that the 
frequentist probability tends to coincide with the a priori one: 


lim —Y— (from the experience!) . (1.4) 


When this condition holds, one says that the cases are equiprobable and mutually 
exclusive. Consider, for example, the roll of a dice: if you are sure that it is not 
rigged, it is intuitive to assume that the probability of getting a certain face (let’s say, 
the number 3) in a throw is equal to 1/6. Experience shows that, after several throws, 
the frequentist probability (also called frequency limit, Eq. (1.3)) tends actually to 
1/6, according to (1.4). If the die is not balanced, the probability of obtaining face 
number 3 can only be evaluated by running many trials. Since the limit for an infinite 
number of trials is not practically reachable, one usually stops to a high but finite 
number n of trials and the true probability is estimated by the confidence interval 
method (see Chap. 6). 

The frequentist definition (1.3) would therefore seem the most general and 
reliable; however, this is not true: 


e Since an experiment cannot be repeated an infinite number of times, the 
probability (1.3) will never be determined. 

e The experiment must be repeatable, and the limit appearing in (1.3) does not 
have a precise mathematical sense. This leads to insurmountable mathematical 
inconsistencies in proving the validity of the empirical case law (1.4). 


The statistician B. de Finetti, in one of his famous articles [DF33], comments on this 
last point as follows: “... for a large category of the problems for probability theory 
(but not for all, as it is shown by the absurdities found and by the ones which could 
easily be found), by imagining an infinite sequence of similar experiences, one can 
build up an example of a possible course of results in a way as to obtain a limit 
frequency equal to probability, for each sequence of similar events.” 

The decision on the best approach to use (subjective-Bayesian, a priori-classical 
or frequentist), based on the type of problem to be addressed (uncertainty in a broad 
sense or variability of the results of repeatable experiments), is still an open question 
and is a continuous source of disputes. 


1.4 Axiomatic Probability 13 


To definitively get out of this confused situation, the modern probability theory 
resorts to axiomatization. In the next paragraph, we will see in fact that, after 
defining the probability in an abstract mathematical way, it is possible to outline a 
consistent mathematical theory for the study of random phenomena. The inductive 
and arbitrary approach is limited to the initial decision about what probability 
to adopt: once the choice of a probability that obeys the required axioms is 
made, this theory can be applied correctly. Then, if the obtained results are in 
disagreement with the experimental outcomes, it will be necessary to change the 
type of probability to be used for that problem. For example, it is perfectly possible 
to invent a probability that, in a lottery, assigns a higher probability to the delayed 
numbers. If this probability obeys the axioms, the approach is mathematically 
correct. However, in this case you will always get results totally different from those 
observed. Therefore, in fair games, as well as in statistical physics, the assumed 
probabilities are classic and frequentist, which leads to results in accordance with 
experience. 

This book, which is dedicated to students and researchers in technical-scientific 
fields, is based on the frequentist approach. However, we will mention in some cases 
even the Bayesian point of view, referring the reader to more specific texts, such as 
[Gre06]. 


1.4 Axiomatic Probability 


To formalize in a mathematically correct way the concept of probability, it is 
necessary to apply the set theory to the fundamental notions introduced so far. If 
S is the sample space of a random variable, we consider the family F of subsets of 
S according to the 


Definition 1.5 (o-algebra) Any collection F of subsets of S having the properties: 


(a) the empty subset belongs to F: @ € F; 
(b) if a countable collection of subsets A;, Az2,... € F, then 


[o,@) 
LJAieF: 
i=1 


(c) if A € F, the same holds for the complement: A € F; 
is named o-algebra. 


Using the well-known properties: 


14 1 Probability 


it is easy to show that also the intersection of a countable collection of sets belonging 
to F and the difference A — B of two subsets of F are included in F: 


Co 
(\AieF, (1.5) 
A-BeF. (1.6) 


The correspondence between probability and set theories is summarized in 
Table 1.1. If, to fix ideas, we consider a deck of cards and we define the draw 
of an ace as event A and the extraction of a diamonds suit (Fig. 1.5) as event B, we 
get the following correspondence between sets (events) and elements of S: 


S: all the 52 playing cards; 

a: 1 of the 52 playing cards; 

— AU B: diamonds suit or heart, clubs, aces of spades; 
— AM B: diamonds suit; 

— A-— B: hearts, clubs or aces of spades; 

— A: any card except aces; 

B: a non-diamonds suit 


Table 1.1 Correspondence between probability and set theories 


Symbol Probability theory 
5 Total set __——=SS—~* Sample spi 


a An element of S Result of a trial 

A Subset of S If a € A the event A occurs 
i) Empty set No events occur 

A Collection of elements of S The event A does not occur 
AUB Elements belonging either to A or to B The events A or B occur 
ANB Elements that belong to both A and B The events A and B occur 


A-B Elements of A not The event A occurs, but 
ACB The elements of A belong If A occurs 


Fig. 1.5 The random 
variable “playing card” and 
the events “extraction of an 52 cards 
ace” and “extraction of a 
diamonds suit” according to 
set theory 


1.4 Axiomatic Probability 15 


Let us now consider a function P(A), for A belonging to a o-algebra J, that brings 
the set A to a real number in the range [0.1]. In symbols, 


Pr 0,1). (1.7) 


According to the Kolmogorov approach, the probability follows the 


Definition 1.6 (Kolmogorov Axiomatic Probability) A function P(A) satisfying 
(1.7) and the properties: 


P(A)=0; (1.8) 
P(S)=1; (1.9) 
and, 
Co (oe) 
(Ya) = )> P(Ai) if ANA; =O VIF), (1.10) 
i=1 i=l 
for any countable collection Aj, A2,... of mutually disjoint subsets included in F, 


is called probability. 
Definition 1.7 (Probability Space) The probability triplet: 
E€=(S,F,P), (1.11) 


composed by the sample space, a o-algebra F and P is named probability space. 


The Kolmogorov probability satisfies the following important properties: 


P(A) =1-— P(A), (1.12) 
P(@) =0, (1.13) 
P(A)< P(B) if ACB. (1.14) 


Equation (1.12) is valid since the complement A is such that by definition AUA = S; 
therefore, P(A) + P(A) = P(S) = 1 from (1.9, 1.10), since A and A are disjoint. 
Moreover: 


P(SU#) = P(S)=1  from(1.9), (1.15) 
P(SU@) = P(S)+ P@) =1 ~~ from(1.10), (1.16) 


from which Eq. (1.13) follows: P() = 1 — P(S) = 1-1 = O. Finally, when 
A C B one can write B = (B — A) UA, where B — A is the set of the elements of 


16 1 Probability 


B notin A. Then: 
P(B) = P[(B — A) UA] = P(B — A) + P(A) 


and, since P(B — A) > 0, the property (1.14) is also proved. 
Another important proposition is: 


Theorem 1.1 (of Addition) The probability of the event given by the occurrence 
of the events A or B, when AN B # &, is given by: 


P(AUB) = P(A) + P(B)— P(ANB). (1.17) 
Proof It easy to show that (you can draw the sets): 
AUB=AU[B-(ANB)], 
B=[B-(ANB)]U(ANB); 
since A U B and B are disjoint sets, it is possible to apply Eq. (1.10) to obtain: 
P(AUB)= P(A) + P[B-(ANB)], 
P(B) = P[B —(AN B)]+ P(ANB). 
Then, one gets, by subtraction: 
P(AU B) = P(A) + P(B)— P(ANB). 


oO 


Both classical and frequentist probabilities follow the axioms (1.8—1.10). In fact, for 
the classical probability, we have: 


P(A) = (na4/N) = 0 always, becausen, N >0, 
PS)=NiN=1, 
NMAtTNB NA , 1B 
( ) N Wt W (A) + P(B) 


Similarly, the validity of the axioms can also be proved for the frequentist 
probability, since its limit can be considered as a linear operator. 

The classical and frequentist probabilities previously defined satisfy therefore to 
the properties (1.8—1.17). For example, the classical probability to draw an ace or a 
red card from a deck of cards, based on (1.17), is given by: 


A= ace of hearts, ace of diamonds, ace of clubs, ace of spades, 
B= 13 diamonds cards, 13 hearts cards, 


1.4 Axiomatic Probability 17 


P(AN B) = ace of hearts, ace of diamonds , 
P(AU B) = P(A) + P(B) — P(AN B) = 4/524 1/2 — 2/52 = 7/13. 


The probability associated with the set AN B covers, as we will see, a particularly 
important role in the algebra of the probability. It is called compound probability: 


Definition 1.8 (Compound Probability) The compound probability 
P(ANB) or P(AB) 


is the probability that events A and B both occur. 


Now we introduce a new kind of probability. Suppose we are interested in the 
probability that, after extracting a suit of diamonds, the card is an ace or that, when 
an ace is drawn, the suit is diamonds. We denote by A the set of aces, with B that of 
the diamond cards and with P(A|B) the probability of A occurring after B, that is, 
once a suit of diamonds is drawn, the card is an ace. Obviously, we have: 


P(AIB) = #(outcomes of the diamonds ace ) 
~ #(outcomes of the diamonds suit ) 


1 1 7,13. P(ANB) 
SN a EN (1.18) 
13. 52/ 52 P(B) 


Similarly, the probability of getting a suit of diamonds if an ace is drawn is given 
by: 


P(BIA) = #(outcomes of the diamonds ace) _ 1 _ 1 7 4 = P(BN A) 
~ #(outcomes of an ace) <A 5-59 P(A) 


In the example just seen, the conditional probability P(A|B) to get an ace once a 
suit of diamonds is drawn is equal to the unconditional probability P(A) of hitting 
an ace; indeed: 


1 
P(A|B) = — = P(A)= =. 
(A|B) = 53 (A) 
In this case, we say that the events A and B are independent. However, if A is the 


set [ace of diamonds, aces of spades] and B is, as before, the set of diamonds cards, 
we have: 


1 o- at 
P(AIB) = 2 # PA = =a: 


18 1 Probability 


We see that events A and B are now dependent, because, if one draws a diamonds 
suit, the probability of A is modified. However, Eq. (1.18) is also valid in this case: 


pcajpy = PAOB) _ 152, 1 
~ P(B)~——-5213—s«13 


These examples suggest the following. 


Definition 1.9 (Conditional Probability) The conditional probability of B given 
A is the quotient of the probability of the occurrence of A and B and the probability 
of A: 

P(AN B) 


PGA) =~ nas if P(A) >0. (1.19) 


It is easy to show (this is left as an exercise) that the definition of conditional 
probability (1.19) is in agreement with the general axioms of Kolmogorov (1.8- 
1.10). It is also important to note that 


P(A|B) A P(BIA) , (1.20) 


a fact that appears obvious from the examples just made but that often does not 
appear obvious to our logical-intuitive abilities. Failure to comply with Eq. (1.20) 
is perhaps the source of most of the errors which are done by dealing with 
probabilities. The crucial point is that the correct connection between the two 
probabilities is possible only through Bayes’ theorem, as we will see shortly. On 
this point we recommend Problems 1.16 and 1.17 and also to read about the so- 
called Sally Clark case (see, e.g. [Wik22]). 

We also note that the conditional probability has been introduced as a definition. 
However, for the probabilities we are dealing with, the following property holds. 


Theorem 1.2 (Product of Probabilities) Jn the classical and frequentist frame- 
works, the probability of the event formed by the occurrence of both A and B is: 


P(AN B) = P(A|B)P(B) = P(BIA)P(A). (1.21) 


Proof For the classical probability, if N is the total number of cases and n4g that 
of the favourable ones to both A and B, we have: 
NAB _ NABNB 


P(ANB)=—* = ae P(A|B)P(B) , 


since, by definition, n4g/ng = P(A|B). This property obviously continues to hold 
by exchanging A and B, hence Eq. (1.21). 

For the frequentist probability, the proof is analogous if one replaces the number of 
cases with that of trials. Oo 


1.5 Repeated Trials 19 


In the previous examples, we have introduced the notion of independent events; in 
a general way, we can adopt the 


Definition 1.10 (Independent Events) Two events A and B are independent if 


P(AN B) = P(A)P(B). 


More generally, the events of a family (Aj, i = 1,2...,m) are independent if 
P (Ma) =|[[P«—. (1.22) 
ieJ ieJ 


for any subset J of different indices of the family. 


From Eq. (1.19) it follows that for independent events P(A|B) = P(A) and 
P(B\A) = P(B). Another useful definition is: 


Definition 1.11 (Incompatible Events) Two events are incompatible or disjoint 
when the condition 


ANB=6 
holds. From Eqs. (1.13) and (1.19) we then have: 
P(AN B)=0, P(A|B) = P(B|A) =0. 
For example, if A is the ace of spades and B the suit of diamonds, A and B are 


incompatible events. According to these definitions, the essence of the probability 
calculus can be summarized in the following formulae: 


¢ For incompatible events: 
P(A or B) = P(AUB)= P(A)+ P(B). (1.23) 
¢ For independent events: 


P(A and B) = P(AN B) = P(AB) = P(A): P(B). (1.24) 


1.5 Repeated Trials 


Up to now we have considered experiments performed with one single trial. 
However, often one has to deal with experiments consisting of many trials: two 
cards drawn from a deck, the score obtained rolling five dices and so on. We address 


20 1 Probability 


this problem by considering two repeated trials because the generalization to any 
finite number of trials is obvious, as we shall see later. 

Two repeated trials can be considered as the realization of two events related to 
two experiments ($,, 71, P;) and ($2, 72, P2) which satisfy Definitions 1.6 and 1.7. 
It is therefore natural to define a new sample space S = S; x Sp as a Cartesian 
product of the two initial sample spaces, in which a single event is constituted by the 
ordered pair (x1, x2), where x; € S; and x2 € So and the new space S contains nj n2 
elements, if 7; and m2 are the elements of S$; and S2, respectively. For example, [ace 
of hearts, queen of clubs] is an element of the set S of the probability space relative 
to the extraction of two cards from a deck. Note that the Cartesian product can also 
be defined when S; and S2 are the same sample space. 

Using definition of events, and since A; C S$; and Az C Sb, it is easy to realize 
that: 


Aj x A2 = (A, x S2) al (S] x A2) 3 (1.25) 


The next step is now to define a probability P in S| x S2, which satisfies the axioms 
of Kolmogorov (1.8—1.10) and can be associated in a unique way with experiments 
consisting of repeated trials. Equation (1.24), which is valid for independent events, 
and Eq. (1.25) allow to write: 


P(A, x Az) = P[(A1 x S2) (St x A2)] 

= P(A, x S2|S1 x Az) P(S1 x Az) 

= P(A, x S2) P(S, x Az) (1.26) 

= P(A})P(A2) Are Fi, ArEF2, 
where the last equality is valid because the probability of the set of pairs Ax x Sj in 
the sample space S; x Sj; obviously has the same probability as the Ax event in the 
S; sample space. The probabilities of the events A; € S; and Az € Sz can therefore 
be computed in the space S using the equalities: 

P(A, x S2) = Pi(A1) P2(S2) = Pi(A1) , 

P(S, x Az) = P1(S1) P2(A2) = P2(A2), (1.27) 
which are obvious both for classical and frequentist probabilities. For example, in 
the drawing of two playing cards, the probabilities of the events Aj = [draw an 
ace the first time] and A; x S2 =[ace, any card] are equal, like those of the events 


Az = [extraction of a diamonds suit the second time] and S; x Az =[any card, suit 
of diamonds]. 


1.5 Repeated Trials 21 


As we said, Eq. (1.26) is only considered valid for independent events, for which, 
based on (1.24), the occurrence of any event does not alter the probability of the 
others. 

To better fix ideas with an example, suppose we pull out two cards from a playing 
deck (with replacement into the deck of the first card after the first draw) and let be 
A, the set of aces and A2 the set of diamonds suits. Equation (1.27) becomes: 


P\(A}) : P(A So) bficle 
— i x —— ns 
poo as Mayne Be ao 
13 52 13 
2 (Az) 52 (S| x Az) 53 52” 


whereas Eq. (1.26) gives: 


BA r 4 13 
(Ai x A2) = 52 52 , 

according to the ratio between the number of favourable cases (4 - 13) and the 
possible ones (52 - 52) in the sample space $1 x So. 

The family of sets F; x Fz = {A, x Az: Ay € Fj, A2 € F>} is not in general 
a o-algebra, but it is possible to show that a single o-algebra F; ®© F2 of subsets 
of S; x Sz exists containing #; x F2 and that Eq. (1.26) allows the extension, in 
a unique way, of the probability of each event A C S$, x $2 from the family set 
F\ x Fy» to the product o-algebra F; ® Fz [GS92]. Therefore, we can write the 
product probability space as: 


E=€ @&2 = (Sj X 82, F| ® Fa, P). 


An extension of (1.26) is used when the space Sz cannot be defined in advance but 
depends from the results of the previous experiment €;. A good example is given by 
the Italian lottery, in which five numbers are drawn, without replacing them in the 
box. 

In the case of two trials, we can imagine the extraction of two playing cards: if 
you reinsert the first card drawn into the deck, Sz consists of 52 elements and 51 
otherwise. Given these conditions, you need to define the space S = S; x S2 not as 
a Cartesian product but as the set of all possible ordered pairs of the two initial sets, 
as they result from each particular experiment. We can say that, in this situation, 
event Az depends on event A; and generalize Eq. (1.26) as: 


P(A x B) = Po(B\A) P(A), (1.28) 


22 1 Probability 


resulting in an extension of the product Theorem 1.2 (see also Eq. 1.21). It is 
immediate to show that the a priori and frequentist probabilities match Eq. (1.28). 
The proof for the frequentist probability is identical to that of Theorem 1.2, whereas, 
for the classical probability, it is required to redefine N as the set of possible pairs, 
nap as the set of favourable pairs and ny, as the set of pairs in which, at the first 
extraction, event A occurred. 

At this point, to avoid confusion, it is important to distinguish between inde- 
pendent experiments and independent events. The hypothesis of independent exper- 
iments, which we will use throughout the text, is completely general and implies 
that the experimental procedures that lead to the occurrence of any event are 
independent of those which lead to the occurrence of all the other events. This 
hypothesis has no connection with the number of elements of the sample space. 

On the contrary, in the repeated trial scheme the events will be considered 
dependent when the size of the i-th space S; depends on the (i — 1) trials carried 
out previously. This is the only kind of dependency that one assumes, considering 
repeated trials, when writing the conditional probability P(A|B). Let us consider 
a simple example, where (W, x Bz) is the event which consists in the extraction, 
from an urn containing two white and three black marbles, of one white marble and 
one black marble in this order (without replacing the marble into the urn after the 
drawing). In this case we have (with obvious meaning of the symbols): 


S; =[W1, W2, B1, B2, B3), 


Sj = [set formed by 4 marbles, 2 white and 2 black ones 


or 1 white and 3 black ones] , 


W1W2, W2W1, BIWI, B2W1, B3wWl 
W1B1, W2B1, BiW2, B2W2, B3W2 
W1B2, W2B2, BIN2, B2B1, B3B1 
W1B3, W2B3, B1N3, B2B3, B3B2 


Sy x $2 = =5~x4= 20 elements. 


Since the marbles are not reinserted after the drawing, the events as W1W1, B2B2, 
etc. are excluded. Now we define: 

W, x Sz = [white marble, any marble] , 

S| x Bo = [any marble, black marble] , 


W1B1,W2B1 
W\ x Bo = [white marble, black marble] =| W1B2,W2B2 }=6 elements . 
W1B3, W2B3 


1.6 Elements of Combinatorial Analysis 23 


In the situation of equally probable cases, the classical probability gives: 


six favorable cases 6 3 
PW < 2) = Se SS 
twenty possible pairs 20 10 


So far we have used in a general way the definition of classical or a priori probability. 
Now we note that the probabilities (1.27) of the events W; and Bp» are given by: 


P\(W,) = 2/5, P7(B2|W 1) = 3/4, 


because initially we have in the urn W1, W2, B1, B2, B3, (i.e. two white marbles 
over a total number of five) and, in the second drawing, we have three black marbles 
and one white marble in the urn if the first extracted marble was white, so there are 
three black marbles over four. Now we apply Eq. (1.28) and obtain again: 


P(W, Xx Bo) = Po(B2|Wi) Pi(W1) = 3/4 x 2/5 = 6/20 = 3/10, 


according to the direct calculation of the favourable cases over the total ones. 
If we neglect the order of extraction and define as event the extraction of a white 
and a black marble or vice versa, we must define the sets: 
W, x S2 = [white marble, any marble] , 
B, x Sz = [black marble, any marble] , 
S, x By = [any marble, black marble] , 


S; x W2 = [any marble, white marble] , 


and apply Eq. (1.28): 


3 
P((By x W2)U (Wi x Bo)] = P2(W2|B1) Pi(B1) + P2(B2| Wi) Pi (W1) = 3° 
The result agrees with the ratio between favourable (12) and possible pairs (20). 
The generalization of this scheme for a higher number of repeated trials requires 
the natural extension of the equations discussed here for two trials only and does not 
present any relevant difficulty. 


1.6 Elements of Combinatorial Analysis 


Assuming you are already familiar with the topic, we briefly summarize here the 
basic formulae of combinatorial analysis, which are often helpful in calculating 
probabilities by counting the number of possible or favourable cases. 


24 1 Probability 


To count well, it must be kept in mind that the number of possible pairs (matches) 
A x B between two sets A of a elements and B of b elements is given by the product 
ab and that the number of possible permutations of n objects is given by the factorial 
n!. A selection or arrangement in which order is important is called a permutation; 
a selection in which order is neglected is called a combination. 

Based on these properties, four fundamental formulae can be easily demon- 
strated, which refer to arrangements without repetition D(n,k) of n objects in 
groups of k (using k of the objects at a time), to those with repetition D*(n, k) 
and to combinations without and with repetition C(n, k) and C*(n, k), in which the 
order of the k elements does not matter. 

The formulae, as perhaps you already know, are: 


Da,k) =nn—-1)---m—k4+), (1.29) 
D*(n,k) = nk, (1.30) 
Cink) = Ms (") . G31) 
kl Mn—-b! \k 
a cs at ae 
_(@tk-V!_ (ntk-1 
= opr = ( : ) (1.32) 


where the binomial coefficient formula has been used. 

To understand these formulae, just imagine the group of k objects such as the 
Cartesian product of k sets. In D(n, k) the first set contains n elements; the second 
set contains n—1 elements because the first element is excluded, until you get, after k 
times, a set of (1—k-+1) elements. Instead, if the repetitions in the group of k objects 
are allowed, all the sets will contain n elements each; hence, we obtain Eq. (1.30). 
The base n number system is just a D*(n, k) arrangement: if, for instance, n = 
10, k = 6, we have 10° numbers, from 000,000 to 999,999. 

In Eq. (1.31), where C(n, k) = D(n, k)/k!, the number of groups containing the 
same k objects is not counted, because in this case the order does not matter. 

Finally, to obtain Eq. (1.32) one has to imagine to write, for instance, a 
combination C*(n, 5) as a,a2aza2a7 in a new way: a, * a2 * * *a34a4a5a6a7*, Where 
any element is followed by a number of asterisks equal to the number of times of 
its occurrence; it is easy to verify that there is a one-to-one correspondence between 
the original combinations and all possible permutations in the alignment of letters 
and asterisks in the alternative representation. Since each alignment starts with a, 
it is possible to permute in total n — 1 + k objects, that is, k asterisks and n — 1 
elements (a; with i = 2,...,”) equal to each other, obtaining Eq. (1.32). 

In R, it is possible to calculate n! with the routine factorial (n) and 
the binomial coefficients (1.31) with the routine choose (n,k) . Moreover, the 
routine combn (n,k) prints the combinations (1.31) by columns, but a routine for 


1.6 Elements of Combinatorial Analysis 25 


the calculation of the permutations is not available. For this purpose our routines 
Perm, Combn and Dispn are available to print permutations and combinations 
by rows. 

A particularly useful formula is the hypergeometric law, which allows the 
calculation of the probability to extract k marbles of type A having extracted 
n < a+b marbles without replacement from an urn containing a marbles of type 
A and b marbles of type B. Assuming that all marbles have the same probability of 
being extracted and that extractions are independent, adopting the a priori definition 
(1.2) and using the binomial coefficients, we have: 


a b 
k n—k 
a+b 
n 

In fact, the number of possible cases in the denominator is given by the binomial 
coefficient, while in the numerator we have the number of favourable cases, given 
by the number of elements of the Cartesian product of the two sets consisting of 
C(a, k) and C(b, n — k) elements, respectively. 


In R, the hypergeometrical law probabilities are calculated by the routine 
dhyper(k,a,b,n). 


P(k;a,b,n)= max(0,n — b) < k < min(n,a). (1.33) 


Exercise 1.1 

Find the probability, in a lottery, of a combination of two (pair) or three 
(triplet) numbers out of five numbers between | and 90 drawn from an urn 
(Italian lottery). 


Answer The solution, if the game is not rigged, is given by the hypergeomet- 
ric law (1.33) with a = k and b = 90 — k: 


88 
Glos 
IPR 2.85 5) = TaN = 300 (pair) , 
(3) 
87 
( D 1 : 
P(; 3, 87,5) = ——— = —— (triplet). 


90\ ‘11748 
5 


(continued) 


26 1 Probability 


Exercise 1.1 (continued) 

The same results are obtained by calling dhyper(2,2,88,5) and 
dyper (3,3, 87,5). The pair probability is about | over 400 and that 
of the triplet is about | over 12,000. A game is fair if the payout equals the 
inverse of the probability of the bet; in the Italian lottery, the pair is paid 250 
times and the triplet 4250 times ... 


1.7 Bayes’ Theorem 


In principle, any problem involving the use of probability can be solved with 
the two fundamental laws of additivity and product. However, the algebra of 
probability leads quickly to complicated formulae, even in the case of relatively 
simple situations. In these cases two basic formulae are of great help, those of total 
probabilities and the Bayes’ theorem, as we will show. If the sets B; (i = 1, 2, ..n) 
are pairwise disjoint and collectively exhaustive: 


n 
OE = 8, Bi Be =@ Vi,k, (1.34) 
i=1 


by means of Eq. (1.21), it is easy to show that, for every set A in S: 
P(A) = P[AN (Bi U B2U---U B,)] 
= P[(AN By) U(AN Bo) U---U(AN B,)] 
= P(A|By) P(B1) + P(A|B2) P(B2) + +++ + P(A|Bn) P(Bn) 
n 
= )) P(A|Bi) P(Bi) . (1.35) 
i=1 


Equation (1.35) is called partition theorem or law of total probability. When B, = B 
e By = B, the theorem gives: 


P(A) = P(A|B)P(B) + P(A|B)P(B) . (1.36) 


With these formulae, you can solve problems that happen frequently, such as those 
shown in the following two examples. 


1.7. Bayes’ Theorem 27 


Exercise 1.2 

A disease H affects 10% of men and 5% of women per year. Knowing that 
the population is composed by 45% men and 55% women, find the expected 
number JN of sick persons in a population of 10,000 people. 


Answer The probability of getting sick for each man or woman of the 
population is given by the probability that the individual is a woman times the 
probability that a woman has to get sick plus the probability that the individual 
is a man times of the probability a man has of getting sick. This situation is 
summarized into Eqs. (1.35, 1.36). Therefore, we have: 


P(H) = 0.45 - 0.10 + 0.55 - 0.05 = 0.0725 . 


The expected number of sick persons is obtained by multiplication of the 
number of trials (individuals) times the probability P(H) we have just found. 
We then have: 


N = 10,000 - 0.0725 = 725 . 


Exercise 1.3 
A box contains six white and four black marbles. After two extractions 
without replacement, what is the probability to get a white marble at the 
second draw? 


Answer By indicating with A and B the outcome of a white marble at the first 
and second extraction, respectively, from Eq. (1.36) one obtains, with obvious 
meaning of symbols: 


~~ 56 64 
P(B) = P(B|A) P(A) + P(B\A) P(A) = 910 + ear = 0.60. 


If we now use Eq. (1.35) to express the probability P(A) that appears in (1.21), 
we get the famous Bayes’ theorem. 


28 1 Probability 


Theorem 1.3 (Bayes) When the sets By follow Eq. (1.34), the conditional proba- 
bility P(.By|A) can be written as: 
P(A|Bx)P(B 
P(By|A) = eee , P(A)>0. (1.37) 
>> P(ALBi) P(Bi) 


i=1 


This theorem is perhaps the most relevant result of the elementary algebra of 
probability, because it allows us to reverse the conditional probabilities, avoiding 
the errors resulting from the violation of Eq. (1.20). It is often used to “readjust”, 
based on a real data set Ax, the probabilities P(B,) arbitrarily assigned a priori. 
The procedure to be used is shown in the following examples: we also recommend 
physics students to solve the Problem 1.8 at the end of the chapter. 


Exercise 1.4 

A test for the diagnosis of a disease is 100% sensitive for sick people but is 
also positive in 5% of the healthy people. Knowing that the illness is present 
on average in 1% of the population, what is the probability of being really 
sick if your test is positive? 


Answer Since the diagnostic testing is an important medical problem, let’s 
deal with the topic in a general way. We can define the following conditional 


probabilities: 


P(P|H) = 0.05 False Positive (FP): probability to be positive when healthy, 
P(N|H) = 0.95 True Negative (TN): probability to be negative when healthy, 
P(P|S)=1. True Positive (TP): probability to be positive when sick, 
P(N|S) =0. False Negative (FN): probability to be negative when sick. 


P(P|S) and P(N|H) are known as sensitivity and specificity, respectively. 
From the probability laws one obviously has: 


PUPAE Ney Ar 

P(P|S) + P(N|S) =1. 

A test is ideal when the following conditions hold: 
P(P|H) =0, P(N|H)=1, 
P(P|S)=1, P(N|S)=0. 


(continued) 


1.7. Bayes’ Theorem 29 


Exercise 1.4 (continued) 

Now we have to find the probability P(S|P) of being sick conditioned by the 
positivity of the test. Applying Bayes’ theorem (1.37) and bearing in mind 
that from the data we know that the probabilities to be healthy or sick are, 
respectively, P(H) = 0.99 and P(S) = 0.01, we obtain: 


P(P|S)P(S) 1 x 0.01 


POSIP)= SBS P(S) + P(PIM) PCH) ~ 1x 0.01 + 0.05 x 0.99 


= 0.168, 


that is, a probability of about 17%. 

The result (a low probability with the positive test) seems paradoxical at first 
sight. To help your intuition, we invite you to examine Fig. 1.6, which shows 
the graphical representation of Bayes’ theorem. If 100 people are subjected to 
the test, on average 99 will be healthy and only 1 will be sick; the test, applied 
to the 99 healthy, will fail in 5% of cases, corresponding to 0.05 x 99 = 
4.95 = 5 positive cases; to these the correctly diagnosed case of disease must 
be added. Eventually, we will have only one really sick person of a total six 
positive tests: 


1 
a= 16.67% 


where the small difference with the exact calculation is due only to rounding 
effects. 

The test is then repeated for the positive persons. If the result is negative, then 
the person is healthy, because the test here considered can never go wrong 
on sick people. If the test results were still positive, then it is necessary to 
calculate, based on Eq. (1.24), the probability of a doubly positive test on a 
healthy person: 


P(PP|H) = P(P|H) P(P|H) = (0.05)? = 0.0025 , 


which is about 2.5 per thousand and that of a doubly positive test on a sick 
(which obviously gives again P(P P|S) = 1) and reapply the Bayes’ theorem: 


P(PP|S)P(S) 
P(PP|S)P(S) + P(PP|H) P(A) 
1 x 0.01 


Re 
1 x 0.01 + 0.0025 x 0.99 : 


P(S|PP) = 


(continued) 


30 1 Probability 


Exercise 1.4 (continued) 

The same result is obtained if one uses the initial probabilities P(P|S) and 
P(P|H) to people who have already undergone a test, for which P(S) = 
0.168 and P(H) = 0.802. 

As you can see, not even two positive tests are enough for the certainty of the 
disease. You can calculate by yourself that, in these conditions, the certainty 
comes only after three consecutive tests (about 99%). 

The example shows how careful you need to be with testing which may 
result positive even on healthy people. The opposite is true with the tests that 
are always negative on the healthy persons but not always positive on the 
sick ones. In this case a positive test assures the disease, whereas a negative 
test leaves some uncertainty. There are also cases where the tests have an 
efficiency limited to both the healthy and sick persons. In all these situations, 
Bayes’ theorem allows you to exactly calculate the probabilities of interest. 


100 
HEALTHY SICK 
Pa " ra | 
NEGATIVE NEGATIVE 
94 : 
POSITIVE 
POSITIVE 1 
2) 
sick/positive = 1/46=17% 


Fig. 1.6 Graphical illustration of Bayes’ theorem for a test which gives 5% of false positives for 
a disease affecting 1% of the population 


1.7. Bayes’ Theorem 31 


Exercise 1.5 

A group of symptoms A, Az, A3, Aa can be due to three diseases H), A, 
H3, which, based on epidemiological data, have a relative frequency of 10%, 
30% and 60%, respectively. The relative probabilities are therefore: 


P(M)=0.1, P(Ho)=0.3, P(H3)=0.6. (1.38) 


According to epidemiological data, the occurrence of the symptoms above in 
the three diseases are as follows: 


At Aor A3 Aa 
A 8 8 Y HS) 


fy ff @ 2 2 
ley OS D fA of 


from which it results, for example, that the symptom A? occurs in 80% of 
cases in the Hj disease, the symptom A4 occurs in 70% of cases in the H3 
disease and so on. 


A patient presents only A; and Az symptoms. Which of the three considered 
diseases is the most likely? 


Answer First of all, to apply Bayes’ theorem, it is necessary to define the 
patient as an event A such that: 


A=A,;NA21NAZN AL, 


and to calculate the probabilities of this event, conditional on the three 
diseases (hypotheses) 4, H2, H3: 


P(A|Hi) = P(A\|Hi) P(A2| Hi) P(A3|Hi) P(Ag| Hi) (@ = 12,3). 
From the table, we also obtain: 
P(A|M1) = 9x .8 x .8 x 5 = 0.288 , 
P(A|H2) = .7x 5 x .1 x .01 = 0.00035 , 
P(A|H3) = .9x 9x 6x .3=0.1458 . 


The most likely disease seems to be H}, but we have not yet taken into account 
the epidemiological frequencies (1.38); to deal with this crucial point, it is 
necessary to use Bayes’ theorem! 

We therefore apply Eq. (1.37) and finally get the probabilities for each of the 
three diseases (note that the sum gives 1, thanks to the denominator of Bayes’ 


(continued) 


1 Probability 


32 


Exercise 1.5 (continued) 
formula, which is just the normalization factor): 


0.288 x 0.1 
PC ee ASS 
0.288 x 0.1 + 0.00035 x 0.3 + 0.1458 x 0.6 
0.00035 x 0.3 
ea eee OOF 
0.288 x 0.1 + 0.00035 x 0.3 + 0.1458 x 0.6 
0.1458 x 0.6 
P(H3)A) = E456. 


0.288 x 0.1 + 0.00035 x 0.3 + 0.1458 x 0.6 


The final result shows that H3 is the most likely disease, with a probability of 
about 75%. 

The solution to the problem can also be found graphically, as shown in 
Fig. 1.7: since there are small probabilities, in the figure we consider 100,000 
subjects, who are divided according to the three diseases weighted with the 
epidemiological frequencies 0.1, 0.3, 0.6; applying to these three groups the 
probabilities of the set of symptoms A (0.288, 0.00035, 0.1458), one gets 
the final numbers 2880, 10, 8748. Also in this way we obtain the results 
provided by Bayes’ formula. 


SICK 
100 000 
DISEASE 1 DISEASE 2 DISEASE 3 
10 000 / 000 a 000 
OTHER vA f 
SYMPTOMS SYMPTOMS pO 
SuNipTON SYMPTOM SYMPTOM 
2 880 10 8748 
2 880/11 638=25% 10/11 638 = 0.09% 8 748 / 11 638 = 75% 


Fig. 1.7 Graphic illustration of Bayes’ theorem in 
in common 


the case of three diseases with some symptoms 


1.8 Learning Algorithms 33 
1.8 Learning Algorithms 


Bayes’ formula is the basis of many machine learning codes and artificial intel- 
ligence algorithms, from spam mail recognition to the proper function of electric 
appliances and to the learning of neural networks. The topic is very broad and we 
just want to give you a general idea with a simple example. Suppose Bob is attracted 
to Alice (the example also applies to parts inverted) and that he wants to test if the 
interest is mutual by inviting Alice to have a coffee. Having no information, Bob 
assumes the following probabilities: 
P(OK) = 0.5, attraction, 
P(OK) = 1— P(OK) = 0.5, indifference, 
P(YES|OK) = 0.9, positive answer with attraction, 
P(NO|OK) = 1—- P(YES|OK) = 0.1, negative answer with attraction, 
P(YES|OK) = 0.5, positive answer and indifference, 
P(NO|OK) = 1—- P(YES|OK) =0.5, negative answer and indifference, 
which give 50% probability to the possible existence of attraction by Alice. In the 


case of Alice’s first affirmative answer, the probability of mutual attraction becomes, 
by using the initial data: 


P(YES|OK)P(OK) 


P(OK|YES) = J  §e——____ (1.39) 
P(YES|OK)P(OK) + P(YES|OK)P(ORK) 
_ 0.9-0.5 aigieag 
~~ 0.9:05405-05 ~ °°" 
Instead, in the case of a first negative answer, one has: 
P(NO|OK)P(OK 
P(OK|NO) = ( | dP ) (1.40) 


P(NO|OK)P(OK) + P(NO|OK)P(OR) 
0.1.0.5 


= ——_____—_. = 0.167. 
0.1-0.5+0.5-0.5 


Now the crucial step for learning takes place: in evaluating the probability after 
a second answer, the substitution P(OK|YES) — P(OK) is performed (and 
consequently (P(OK) = 1 — P(OK) ) in Eq. (1.39) in the case of affirmative 
answer or P(OK|NO) — P(OK) in Eq. (1.40) in case of a first negative answer. 
In this way, basic learning is achieved based on the data accumulation, so that the 
probability P(OK), assumed initially to be 50% in lack of initial information, is 
continuously updated and made more reliable. With our routine BayesBobA1, you 
can interactively check how the probabilities evolve as a function of the answers. 


34 1 Probability 


It turns out, for example, that if there are three consecutive negative answers, 
the chance of Alice’s attraction to Bob assumes gradually the decreasing values 
0.167, 0.038, 0.008, confirming the advice given by a friend: “Bob, after three 
refusals, it is better to give up...” 

In the previous example, we showed how learning algorithms extend the applica- 
tion of Bayes’ formula also to the cases where, at the beginning, the probabilities are 
not known with reasonable certainty. In these situations, Eq. (1.37) can be also used 
to modify, during the data collection, the initial probabilities of hypotheses P(H;) 
subjectively evaluated according to statement 1.2. 

Following the method we have outlined above, if we indicate with the generic 
event “data” the result of one or more trials (experiments), in the Bayesian approach 
one applies Eq. (1.37) as follows: 


P (data| Hx) P(Hk) 


P(H,|data) = (1.41) 


) > P(data| Hi) P (Hi) 


i=1 


The probabilities P (H;|data) thus obtained are then substituted for P (Hj) in the 
term to the right of Eq. (1.41) and the calculation can be repeated iteratively: 


P(Ey,| Ak) Ph—1 Hh 
P, (Ay|E) = Ea | Hx) Pn—1 (Hx) 


Y= P(En| Hi) Pa—1 (Hi) 


i=1 


(1.42) 


where E;, is the n-th event. In the following example, the probabilities P(E; | Hx) 
remain constant. 


Exercise 1.6 

An urn contains five black and white marbles in an unknown proportion. 
Assuming the same probability P(H;) = 1/6 for the six possible starting 
hypotheses, written with obvious notation as: 


A, = (5,0), Hy = 4,1), Ha = G, 2), 


H; = (b,n)= ie 2S) fe Se, OD 


calculate the six probabilities P(H;|data) when, with replacement into the 
urn, n = 1,5, 10 black marbles are extracted consecutively. 


Answer The exercise is easily solved by defining the event “data’= E = 
E, as the extraction of a black marble, using, for the a priori probabilities 
P(E|Ax) = P(En| Ax), the values: 


(continued) 


1.8 Learning Algorithms 35 


Exercise 1.6 (continued) 
P(black|H,) =0, P(black|H2) = 1/5, P(black|H3) = 2/5, 
P (black| H4) = 3/5, P(black|H5) = 4/5, P(black|H¢) = 1 


and applying iteratively Eq. (1.42). One gets Table 1.2, from which we see 
that, when increasing the number of black marbles drawn consecutively, 
hypothesis H6 (5 black marbles) becomes more and more probable. 


It is important to note that this problem has not been solved with a pure 
frequentist approach, because for the initial hypotheses the a priori probabilities 
P(H; = 1/6 G@ = 1,2,...,6) were used, which are subjective and arbitrary. 
However, this increased flexibility is paid with a certain amount of ambiguity, 
because with different initial hypotheses, different results would have been obtained, 
as in Problems 1.12 and 1.13. The dilemma “greater flexibility of application in spite 
of some ambiguity of the results” often gives rise to heated debates, as in [JLPe00]. 

The example just seen is therefore fundamental to understand the difference 
between the frequentist and the Bayesian approach: 


¢ Frequentist approach (followed in this book): no arbitrary subjective probabilities 
are assumed for the hypotheses. Therefore, probabilities of hypotheses of the 
type P(H|data) = P(hypothesis|data) are never determined. For a frequentist 
solution of the exercise just seen, you can see later Exercise 6.1. 

¢ Bayesian approach: probabilities as P(hypothesis|data) are determined. They 
depend, via Eq. (1.41), on the initial probabilities arbitrarily assumed for the 
hypotheses and from the data obtained during the trials. 


Table 1.2 Calculation of the a posteriori probabilities P,(H|ne) starting from equal a priori 
probabilities in the case of consecutive extractions with replacement of n black marbles from an 
urn containing five black and white marbles 


Hypothesis A, 56) A Ay Hs 6 

Urn content 00000 oo000@ ooo@e o°o3@ee oeec$e eocc5e 
P (ij) a priori 1/6 1/6 1/6 1/6 1/6 1/6 
P,(Aj|n = le) 0 0.07 0.13 0.20 0.27 0.33 

Py, (Aj|n = Se) 0. 0.00 0.01 0.05 0.23 0.71 

P, (H;|n = 10) 0 0.00 0.00 0.00 0.11 0.89 


36 1 Probability 


1.9 Problems 


1.1 Monty Hall’s game is named after the host of a television game that in 1990 
made a lot of Americans discuss about probabilities. The competitor is placed in 
front of three doors: behind one door there is a car, and behind the others, there 
are goats. He picks a door, say n. 1, and Monty, who knows what’s behind the 
doors, opens another door, say n. 3, which has a goat. He then says to you, “Do you 
want to pick door number 2?” Is it better to change, not to change or the choice is 
indifferent? 


1.2 In the game of bridge, a deck of 52 cards is divided into 4 groups of 13 cards 
and dealt to 4 players. Calculate the probability that 4 players who play 100 games 
a day for 15 billion years (the age of the universe) can repeat the same game. 


1.3 A device is made up of three elements, which can fail independently of each 
other. The probabilities of operation of the three elements during a fixed time T are 
P, = 0.8, po = 0.9, p3 = 0.7. The machine stops due to a fault in the first element 
or for failure of the second and third elements. Calculate the probability P for the 
device to work within T. 


1.4 A device is made up of four elements all having the same probability p = 0.8 
of operation within the time 7. The device stops for a simultaneous failure of the 
elements | and 2 or for a simultaneous failure of elements 3 and 4. (a) Draw the 
device operating flow and (b) calculate the probability P of working within T. 


1.5 Calculate the probability of getting at least a face with 6 by rolling three dices. 


1.6 A quality check of a batch containing ten pieces accepts the whole lot if all 
three pieces chosen at random are good. Calculate the probability P that the lot will 
be discarded in case of (a) one defective piece or (b) four defective pieces. 


1.7 The famous “encounter problem”: two friends X and Y decide to meet in a 
certain place at an hour between 12 and 13, randomly choosing the arrival time. X 
arrives, waits for 10 minutes, and then leaves. Y behaves like X but waits for 12 
minutes. What is the probability P that X and Y meet each other? 


1.8 The trigger problem, common in physics: a physical system randomly produces 
the events A and B with probability 90% and 10%, respectively. A device, designed 
to select the good B events, enables (triggers) the recording of the events A and 
B in 5% and 95% of cases, respectively. Calculate the percentage P(T) of events 
accepted by the trigger and the percentage P(B|T) of B type events among those 
accepted. 


1.9 Problems 37 


1.9 Evaluate the probability P{X < Y} that in a test, in which two coordinates 
0 < X,Y < 1 are randomly extracted in a uniform way, one gets the values x < y. 


1.10 Three electronic firms, A, B and C, supply identical components to a 
laboratory. The supply percentages are 20% for A, 30% for B and 50% for C. The 
percentage of defective components of the three suppliers is 10% for A, 15% for B 
and 20% for C. What is the probability that a component chosen at random will turn 
out to be defective? 


1.11 A certain type of pillar has the breaking load R uniformly distributed between 
150 and 170 KN. Knowing that it is subjected to a random load C evenly distributed 
between 140 and 155 KN, calculate the probability of the pillar failure. 


1.12 Solve Exercise 1.6 in the case of consecutive extraction of n = 5 black 
marbles, assuming the following initial probabilities (of binomial type): P(H,) = 
0.034, P(A) = 0.156, P(A3) = 0.310, P(H4) = 0.310, P(Hs) = 
0.156, P(H6) = 0.034. 


1.13 If you assume that your friend is 50% honest and 50% cheating, find the final 
probability that the friend is a cheater after n = 5, 10, 15 consecutive wins. 


1.14 The probability of the three events A, B and C is different from zero. State 
whether the following statements are true or false: 

1) P(ABC) = P(A|BC)P(B|C)P(C); 2) P(AB) = P(A)P(B); 3) P(A) = 
P(AB) + P(AB); 4) P(A|BC) = P(AB|C)P(B|C). 


1.15 A randomly chosen thermometer from a sample marks 21° Celsius. From 
the production standards, you know that the probabilities that the thermometer 
shows the temperature decreased by one degree, the right one and that increased 
by one degree are 0.2, 0.6, 0.2, respectively. The subjective a priori probabilities 
about the temperature of the environment, according to a survey, are: P(19°) = 
0.1, P@0°) = 0.4, P(21°) = 0.4, P(22°) = 0.1 . Calculate the a posteriori 
probabilities of the measured temperature. 

(Hint: indicate the temperatures to be evaluated as P(true|measured) = 
P(true|21°)). 


1.16 The probability of honestly winning a lottery is estimated at one over a million 
(10~°). Prove that the probability for a winner to be honest is not 10~°! 


1.17 The likelihood of a DNA test making a wrong association is evaluated in one 
case over 10,000. In a town of 20,000 inhabitants, in which it is certain that the 
responsible for a serious crime is present, all the inhabitants are tested for DNA. 
What is the probability that a positive tested person is guilty? 


1.18 Find the probability of the event depicted on the book cover. 


Chapter 2 ® 
Representation of Random Phenomena crests 


Science is predicated upon the belief that the Universe is 
algorithmically compressible and the modern search for a 
Theory of Everything is the ultimate expression of that belief, a 
belief that there is an abbreviated representation of the logic 
behind the Universe’s properties that can be written down in 
finite form by human beings. 


John D. Barrow, “THEORIES OF EVERYTHING: THE QUEST 
FOR ULTIMATE EXPLANATION”. 


2.1 Introduction 


We will begin this chapter by better defining the formalism, notation and termi- 
nology that will accompany us throughout the rest of the book. Without this very 
important step, the reader would risk to misunderstand the meaning of most of the 
basic equations of probability and statistics. 

We will continue by describing the representation of events in histograms, 
which is the most convenient for the correct development of both probabilistic and 
statistical theories and the one closest to applications. This choice will lead us to 
immediately define the first probability distribution, the binomial, while the other 
distributions will be studied later, in Chap. 3. 

From now on we begin the systematic use of the R software to explain all the 
new concept and topics that will be presented. We therefore recommend the reader 
to install R on her/his computer before reading this chapter, and to get some practice 
through the online instructions and the good manuals that can be downloaded online. 
In addition, Appendix B should also be read in parallel with this chapter. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 39 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_2 


40 2 Representation of Random Phenomena 
2.2 Random Variables 


Given the Definitions 1.5, 1.6 and 1.7 from the previous chapter, we consider a 
probability space € = (S, F, P): let-a € A C S be the results of an experiment 
realizing the event A of the o-algebra F. We associate to each element a a real 
number using the function: 


X:S— (—oo,+00), that is, X(a)=x. (2.1) 


Then we have the following definition. 


Definition 2.1 (Random Variable) A random variable X (a) is a function having 
the space S as domain, the real axis as codomain and such that the set of elements a 
for which the relation 


X(a) <x (2.2) 


holds is an event for any x € R. 


Be careful because this is an important conceptual step: the random variable is 
defined as a correspondence or function leading from the sample space to the real 
axis. Obviously, it would be more appropriate to speak about a random function 
instead of a random variable, but this terminology is the standard one. If a, € Aj 
and a2 € Az are elements of two sets (events) of F, based on Eq. (1.6), the element 
a € (A2 — A}) will also belong to a set (event) of the field. If now X (a) < x; and 
X (a2) < x2, (A2 — A1) will be the event corresponding to the numerical set: 


xy < X(a) <x thatis xj <X <x. (2.3) 


Since a random variable establishes a correspondence between events and real 
numbers, from Definition 2.1 and from Eq. (1.5), it follows that also the set: 


(Via :x0-* < X(a) < xo} = {a: X(a) = x0} = {a : X = x0}, (2.4) 


n=1 


where xo is a real number, is an event. Let us take, as an example, the experiment 
consisting in the extraction of a numbered marble in the lottery. According to the 
formalism just introduced, Eq. (2.2) becomes: 


EXTRACT AND READ (marble) < integer number . 
The domain of this law is the set of marbles, the codomain is a subset of the real 


axis and the random variable is defined as “random extraction of a marble and read 
out of the number’. In general, if Ro is any subset of the real axis, representable as 


2.2 Random Variables 41 


union (sum), intersection (product) or difference of intervals, it is easy to show that 
also the numerical set: 


{a: X(a) € Ro} 


can be referred to a o-algebra F and defines an event. 
To summarize, in this book we will use the notation: 


{X > xo}, {X < xo}, {x1 SX Sx2}, {X | Ro}, (2.5) 


to indicate a set A € F of a elements obtained experimentally (not a set of real 
numbers!) for which the function X(a) satisfies the numerical condition within 
braces. The probability: 


P(A) = P{X(a) € Ro} (2.6) 


is named distribution of X. It is an application Ro — P{X(a) € Ro} which 
associates the probability of X to assume values in Ro for any subset Ro C R. 
The distribution is a function defined on a set. It can be put in correspondence with 
linear combinations of sums or integrals of functions, which are easier to handle. 
These are the cumulative and density functions, which we will define shortly. We 
consider probabilities for which the condition: 


P{X = +00} = 0 


holds. Note that this property does not exclude the variable from assuming the 
infinity as a value; however, the probability of these events is zero. In other books 
you can find the notation: 


x(a) <x 


or other equivalent forms instead of Eq. (2.2). 

We will use uppercase letters (usually the latter ones) to indicate the random 
variables, and we will reserve the lowercase letters to represent the occurrences 
(sometimes called random variates), that is, particular numerical values obtained in 
an experiment; when we have conflicts of notation, we use uppercase bold letters 
such as X for random variables. 

A random variate is sometimes called deviate when it’s different from the 
mean or other central parameters (often divided by the standard deviation of the 
distribution). Moreover, to understand what we indicate, you will have to pay 
attention to the brackets: the expression {...} will always represent a set of the 
sample space, corresponding to the set of real values indicated inside the curly 
brackets. 


42 2 Representation of Random Phenomena 


Table 2.1 Comparison between the present notation and other current notations 


Meaning Book notation Other notations 
Result of a trial a a 

The result is an event acA aca 

Random variable X(a)=x x(a) =x 

Ifa e A, X € Ro {X € Ro} {x € Ro} 
‘Probability P(A) P{X € Ro} P{x € Ro} 
Real codomain of X (a) Spectrum or support Codomain, support 
“Operator on 
random variable O[X] O[x], O(X) 
“Expected value E[X], (X) E(x), E(X) 
Variance Var[X] Var[x], 02(X) 


The notations we use are given in Table 2.1. We stress the difference between 
capital and lowercase notation: 


e X is the random variable that can take on a finite or infinite set of possible 
numerical values, before one or more trials. 

* x is a number, that is, a specific numerical value of X after a specific trial. The 
n-tuple of values (x1, x2,..., X) is the result of a sampling. 


It is important to note that the random variable must be defined on all elements of 
the sample space S. As an example, consider a space (S, F, P) and an event A with 
probability p, so that P(A) = p. It is natural then to define the function X, called 
dummy function or variable, such that: 


X(a)=lifacA, X(a)=0ifacAd. (2.7) 


It is easy to see that X is a random variable: 


* Ifx <0 then to X < x corresponds the empty set #. 
¢ If0 <x <1 thento X < x corresponds the set A. _ 
e Ifx > 1 then to X < x corresponds the sample space AU A = S. 


From the properties of the o-algebra F, if A is an event also the sets #, A and S are 
events; hence X satisfies Definition 2.1 and is therefore a random variable. Several 
random variables can be defined on the same probability space. For example, if 
the sample space consists of a set of persons, the random variable X (a) can be the 
weight of a person and the variable Y (a) its height. We will also often have to deal 
with functions of one, two or more random variables, such as: 


Z = f(X,Y). (2.8) 


The Z domain is formed by the elements of the sample space S, since by definition 
X = X(a) and Y = Y(b) with a,b € S, so that also Z = f(X(a), Y(b)) holds. 


2.2 Random Variables 43 


Since X(a) = x and Y(b) = y, it is possible to associate to Z also a “traditional” 
function with real domain and codomain: 


z= f(x,y). 


So far, the discussion shows that the random variable creates a correspondence 
between countable unions, intersections, differences and complements of sets of 
S and intervals of real numbers. It is then possible to prove (but it is quite 
intuitive) that, for every real z, if the inequality f(x, y) < z is satisfied by unions, 
countable intersections or differences of real intervals of the variables x and y, 
then the variable Z of Eq. (2.8) is also a random variable obeying Definition 2.1 
[Cra51, PUPO2]. If X and Y are random variables defined on the same probability 
space, the same happens for all the functions f(X, Y,...) that will be considered 
later, so we will always assume, from now on, that the functions of random variables 
are also random variables. 

Following our notation, the difference between Z = f(X, Y) andz = f(x, y) 
can be explained with the following example. Consider the sum: 


Z=X4+Y 
and the experimental results: 
Wy=x+y and z2=x2+y2. 


The random variable Z indicates a possible set of results (z1, z2, .. .), which are the 
outcomes of Z in an experiment consisting of a series of trials where X and Y occur 
and results are added together. Instead, z; represents the value of Z obtained in the 
first test, z2 the one obtained in the second test, and similar for x; and y;. 

The independence between events also defines the one between random vari- 
ables: 


Definition 2.2 (Independent Random Variables) If the random _ variables 
Xi, @ = 1,2,...,m) are defined on the same probability space and the events 
{X; € Aj}, @ = 1,2,...,) are independent, according to Definition 1.10, for any 
possible choice of the intervals Aj € R,, from Eq. (1.22), it results: 


P {X, € Ay, Xp € Ap, ..., Xn € An} =| | P{Xi € Ai}. (2.9) 
i 


In this case one says that variables X; are stochastically independent. 


In the following, we will simply state that variables are independent, but, when 
necessary, we will distinguish independence from a specific type of mathematical 
dependence, such as linear dependence (or independence), which implies the 
existence (or not) of linear equations between variables. 

Finally, one last definition. 


44 2 Representation of Random Phenomena 


Definition 2.3 (Spectrum or Support) The spectrum or support is the real 
codomain of the random variable X(a). The spectrum is called discrete when 
the codomain is a countable set and is called continuous when the possible values 
are in R or in subsets of R. 


A variable X with discrete spectrum is named discrete random variable, while 
a variable with a continuous spectrum is named continuous random variable. 
In mathematics the name spectrum is used in other contexts, as in the theory of 
transforms. For this reason, the term support is generally used in mathematical 
statistics. However, among physicists and engineers, during research or laboratory 
activities, it is quite usual to speak about “the spectrum” rather than about the range 
or support of the random variable being examined. There are also fields of physics, 
as atomic or nuclear spectroscopy, where the probabilities of discrete or continuous 
energy states assumed by a physical system are studied. However, since in the 
following the use of the term spectrum will not cause any conflict, we decided to 
maintain this term, beside to the support one. The law or distribution (2.6) therefore 
associates a probability to values (or sets of values) of the spectrum. 


2.3 Cumulative or Distribution Function 


The law or distribution of a random variable is usually expressed in terms of the 
cumulative or distribution function, which gives the probability that the random 
variable assumes values less than or equal to a certain assigned value x. 


Definition 2.4 (Cumulative or Distribution Function) If X is a continuous or 
discrete random variable, the cumulative or distribution function: 


F(x) = P{X < x} (2.10) 


represents the probability that X assumes a value not greater than an assigned value 
x. Cumulative functions will usually be indicated by uppercase letters. 


Notice that x does not have to be part of the spectrum of X; for example, in the case 
of a die roll, where X = 1, 2,3,4,5, 6: 


F(.4) = P{X <3.4} = P{X <3} = FG). 


If x1 < x2, the events {X < x;} and {xj < X < x2} are incompatible and the 
total probability of the event {X < x2} is given by Eq. (1.10): 


P{X < xo} = P{X < x1} 4+ P{x] < X < xp}, 


2.3. Cumulative or Distribution Function 45 


from which, according to Eq. (2.10): 
P{x, < X < x2} = F(x2) — F(x). (2.11) 
Since the probability is non-negative, we have: 
F(x2) = F(x). 


If xmax and Xmin are the maximal and minimal X values, respectively, from 
Eq. (1.13) and from Definition (1.9), it follows: 


F(x) =0 for x <xmin, (2.12) 
lee) 
F(x) = 0 pi) =1 for x > tnx. (2.13) 


i=l 


It also turns out, by construction, that F(x) is continuous at each point x if 
approached from the right. This fact depends on the position of the equal sign 
appearing in Eq. (2.11): 


lim [F (x2) — F(x1)] = P{X = x2}, (2.14) 
X{>x2 


lim [F(x%2) — F(x1)] =0 (continuity to the right) . (2.15) 
XIX] 


It can be shown that any cumulative or distribution function fulfilling the properties: 


lim F(x) =0, lim F(x)=1, (2.16) 
X—>—0O X—-+00 


is non-decreasing and continuous to the right. Conversely, if a function satisfies 
these properties, it represents the cumulative function of a random variable. For 
mathematical details on the proof, you can see [Cra51]. 

We also note that from (2.10) it is easy to calculate the probability of obtaining 
values greater than an assigned limit x: 


P{X >x}=1-F(x). (2.17) 


The cumulative F(x) allows the definition of a very useful quantity: 


Definition 2.5 (Quantile) The a quantile is the smallest x, value that obeys to the 
inequality: 


P{X < XxXqg} = F(xq) =a. (2.18) 


46 2 Representation of Random Phenomena 


The inequality > in Eq. (2.18) takes into account that, for discrete variables, F (xq) 
may not coincide with a when this value is assigned a priori. For continuous 
variables, F(x.) = @ and the quantile is the value x = F —l(q). 


For example, if P{X < xy} = F(%q) = 0.25, it means that xo.25 is the 0.25 
quantile; if the a values are given as percentages, one says that x is the 25-th 
percentile or that x is between the second and third decile, or between 20% and 
30%. Table B.2 of Appendix E indicates how to use R to get the quantiles of the main 
probability distributions. In R, the quantile routine estimates quantile values by 
interpolating from a set of random data. For example, if we generate a vector of 
10 uniform random variates in [0, 1] with the instruction x<-runif (10), we can 
estimate the 20% and 40% quantiles with the call: 
quantile(x,c(0.20,0.40),names=FALSE) , 

where the variable names inhibits the complete output and produces a numerical 
vector containing the quantiles. In this case, we will see that the two output values, 
interpolated between the data of the ordered vector in ascending order, have 2 and 4 
values to the left, respectively. 


2.4 Data Representation 


The most used representation to analyse data samples coming from an experiment 
is called histogram. In the case of a discrete random variable, the histogram is built 
as follows: 


e The x axis represents the spectrum of X. 
¢ On the y axis the number of times that each spectrum value appeared in the 
sample is recorded. 


Consider an experiment formed by 100 trials, each trial consisting of tossing 10 
coins. We define the event as the number X of heads in a toss; the spectrum of X is 
given by the 11 integers 0, 1, 2, ..., 10. Every value is labelled with the number of 
times that number of heads occurred in those 100 trials. 

If we report in abscissa the spectrum of the event (shown also in the first column 
of Table 2.2), and in ordinate the results obtained in a real experiment (shown in the 
second column of Table 2.2), we obtain the histogram of Fig. 2.1. The number of 
events having x; heads is denoted by n(x;), which is called the number of events 
or trials fallen into the i-th bin of the histogram. Obviously, one has: 


Cc 


Y > ni) =N, (2.19) 


i=1 


where C is the number of bins of the discrete spectrum and N the total number 
of events of the trials made in the experiments. One can also say that the 
histogram represents a sample of N events. The histogram thus constructed has 


2.4 Data Representation 47 


Table 2.2 Results of a real 
experiment made of 100 
observations, each consisting 


Spectrum Number of | Frequency | Cumulative 
(number of heads) | trials 


in the tossing of 10 coins. The _0 0 __| 9.00 0.00 
second column reports the 1 0 | 0.00 0.00 
number of observations in 2 5 0.05 0.05 
aa ee asl 2 ae 3 13 10.13 0.18 
reported in the first column 
was obtained. The third E. Z oe ae 
column reports the empirical 35 25 {0.25 055 
frequencies obtained simply 6 24 0.24 0.79 
by dividing the values of the 7 14 10.14 0.93 
second column by 100, the 8 6 | 0.06 0.99 
fourth column reports the 9 1 0. 01 60 
cumulative frequencies — a Sa 
10 | 0 0.00 | 1.00 

25+ n(x) oe. a) 25- n(x) —— b) 

20 20- 

15- e 15- 


—_ 
wn —) 
T T 
e 
e 
e 
e 
— 
wn S 
T T 


Fig. 2.1 Two ways to build a histogram of an experiment where x heads are counted after the toss 
of 10 coins: the values 0 < x < 10 of the spectrum are reported in abscissa; the number of events 
n(x), for each value x of the spectrum, is reported on the ordinate, as a point with abscissa x; (a) 
or as a bar as wide as the distance between two spectrum values (b). These data, reported also in 
Table 2.2, refer to an experiment of 100 trials 


the disadvantage of having the ordinates diverging with the number of trials, since 
n(x;) — oo for N — oo. This drawback is corrected by representing the sample 
as a normalized histogram where the number of events n(x;) is replaced by the 
frequency: 


n(x;) 


iG) >— 


(2.20) 


The frequencies of the example of Fig.2.1 are shown in the third column of 
Table 2.2. In this way, when N — ov, the contents of the bins remain finite and, 
based on Eq. (1.3), f(x;) tends to the probability p(x;) to fall into the i-th bin of 
the spectrum. Equation (2.19) then becomes: 


Cc 
Ss faye, (2.21) 


i=l 


48 2 Representation of Random Phenomena 


Fig. 2.2 Histogram of the ° 
cumulative frequencies of the ~ 
data shown in Fig. 2.1 
a 
i=) 
<2 
—_ 2 
tes 
— 
Wis 
So 
pe 
[-) 
° 
o 


which is the normalization condition. 

The frequencies of the histogram can also be represented as cumulative frequen- 
cies, usually denoted by capital letters: for the k-th bin they are obtained, as in 
Fig. 2.2, by adding its content to those of all the bins to the left and dividing by the 
total number of events: 


k 


Dm 5 
Fy = = = D0 fai). (2.22) 
i=l 


The cumulative frequency Fy = F (xx) gives the percentage of sample values x < 
xx. It is the “experimental” estimate of the cumulative function (2.10). 

Now we show how it is possible to extend the representation by histograms also 
to the case of continuous variables that can assume values over any range [a, b] 
of the real number field R. Let x be the values of the continuous spectrum under 
consideration, belonging to the interval [a,b]. We divide this interval into equal 
parts [x1, x2), [x2, x3),..-, [Xm—1, Xm] of width Ax, and assign to each new interval 
(channel or bin) a value given by the the number of events having the value of the 
spectrum contained in that interval. If we divide for the total number of events, we 
obtain the analogue of Eq. (2.20): 


n(Axg) 


= (2.23) 


f (Axx) = 


2.5 Discrete Random Variables 49 


For N — o, based on Eq. (1.3), f(Ax~,) — p(Ax,), which is the probability 
to obtain spectrum values in [xz,~-1, xg] that is in the k-th bin. The graphical 
representation of continuous variables is typically that of the histogram of Fig. 2.1b, 
which indicates that values are distributed within the whole bin; if you want to 
use the representation of Fig.2.la, the abscissa of the point is the central value 
of the bin. Histograms can be obtained using the R routine hist (x), which 
plots the histogram of a vector x of raw data. The graphical style can be changed 
with the options listed in the online R manual. Alternatively, you can use our 
HistoBar(x,fre) routine which, in addition to the raw data, allows you also 
to draw histogram of data collected in two vectors: x containing the bin values and 
fre containing the frequencies n(x)/N or the number of events n(x). If fre and x 
have the same size, x is interpreted as the average value of the bin or as the spectrum 
of a discrete variable; if x has one position more than fre, it is interpreted as the 
vector of the bin breakpoints of a continuous variable. 
For example, Fig. 2.2 has been obtained with the following lines: 


fre <- c(0,0,0.05,0.18,0.30,0.55,0.79,0.93,0.99,1.) 
x <- c(0,1,2,3,4,5,6,7,8,9,10) 
HistoBar (x, fre,xex='x’ ,yex='F(x)’) 


where the calling sequence is explained in the comments of HistoBar. 


2.5 Discrete Random Variables 


As we saw in Definition 2.3, a discrete random variable X takes at most a countable 
infinity of values (x1, x2,...,%n). We also know, from Eq. (2.4), that the sets 
{X = x;} are events. We can then define the probabilistic analogue of frequency 
histograms, as follows: 


Definition 2.6 (Probability Density Function for Discrete Variables) Given a 
discrete variable X, the function p(x), given by: 

P(xi) = P{X = xj} (2.24) 
for the discrete values of X and p(x) = 0 outside, is called probability density 
function (sometimes abbreviated as p.d.f.) or, more simply, density. 


Physicists usually call a p.d-f. a distribution. Statisticians reserve the name distri- 
bution for the cumulative function. In the following we will use mainly the term 
cumulative function. 

This density satisfies the important normalization condition: 


Sron= Pore =x= 0 (Yor=] = P(S)=1. (2.25) 
i=l 


i=1 i=1 


50 2 Representation of Random Phenomena 


The knowledge of the density allows the calculation of laws or statistical distribu- 
tions as in Eq. (2.6): 


P(A) = P{X(a) €Ro}= D> PIX =x}= D> pia). (2.26) 


x;ERo x;ERo 


The cumulative or distribution function (2.10) can then be expressed as: 
k 
P{X < xi} = Faw) = D> pi). (2.27) 
i=1 


This function can be seen as the probabilistic correspondent of the cumulative 
frequency histogram (2.22). An example of density and cumulative functions is 
shown in Fig. 2.3. As we know, the cumulative function is defined for every x, not 
just for discrete values assumed by the variable X. Hence, in the case of Fig. 2.3 and 
Table 2.3 we have, for example: 


F(6.4) = P{X < 6.4} = P{X =0, 1, 2,3,4,5 or 6} = 0.625. 
F (x) shows jumps of heights P{X = x;} for discrete X values and remains constant 


within [x;, xx41). Continuity on the right is shown graphically in Fig. 2.3 with dots 
in bold to the left of the constant values in the ordinate. The important Eq. (2.11) 


Fig. 2.3. Bar representation 
of the binomial probability 


0.25+ p(x) 
density p(x) = b(x; 10, 1/2) 0.24 
and corresponding cumulative 
function F(x). This 0.15; 
distribution is the population 
4 6 
x 


model for the 10 coin ut 
experiment of Table 2.2. The — g.g5L 
data are also reported in 


H 
ve 


ee 
8 


_L 
Table 2.3 0 0 10 
1 [- F(x) @e—_8___e——_ 
0.8- : 
0.6- —_ 


0.2F — 
i ee 
% 2 4 6 8 10 


2.6 Binomial Distribution 51 


can be written using the density function as follows: 
P{X = xg} = P{xp-1 < X < xe} = Fae) — F(xe-1) = pr), (2.28) 


which allows one to perform calculations in terms of density function or cumulative 
function, as it turns out easier. 

There is a distribution, called binomial or Bernoulli, able to predict the results of 
experiments like those of Fig. 2.1 and Table 2.2. This is one of the most important 
results of probability theory. 


2.6 Binomial Distribution 


Consider an experiment consisting of n attempts and let p be the a priori probability 
to obtain the aimed event (success) in each attempt. We want to find the probability 
of the event consisting of x successes and n — x failures in the considered 
experiment. The problem therefore requires the determination of a probability 
function b(x;n, p) where (be careful!) n and p are assigned parameters and 
b(x;n, p) is the probability of the event consisting of x successes. 

Consider now a series of results consisting of x successes and n — x failures, 
denoted by the symbols X and O, respectively: 


XXOOOX...XO 
XOOXOO...OX 


According to the law of compound probabilities (1.24), if the events are indepen- 
dent, the probability of each configuration (row) is the same and is given by the 
product of the probabilities of obtaining x successes and (n — x) failures, that is: 


p:p:..-d-—p)-d-p)-d p) = p” (i= py. 


The possible alignments are as many as the combinations of n elements of which 
x and n — x equal to each other. We know from combinatorial analysis that this 
number is given by the binomial coefficient (1.31): 


aaa" () 
xl(n—x)! \x)° 


Since an attempt realizing the requested event gives any one of the previous lines, 
according to the law (1.17), the probabilities of the rows must all be added up to 


52 2 Representation of Random Phenomena 


obtain the final probability of the event. We therefore have the final expression of 
the binomial density function: 
ye 


! 
P{X =x favorable outcomes} = b(x;n, p) = gr —p 
x!(n — x)! 


= (") Paap). (2.29) 
Xx 


It is important to always remember that this distribution is valid if and only if the 
n attempts are independent and the probability of success in an attempt is always 
constant and equal to p. The binomial distribution, when n = 1, is also called 
Bernoulli distribution, in memory of the Swiss mathematician Jacques Bernoulli, 
who first introduced it in the end of the seventeenth century. It has numerous 
applications, as the following examples show. 


Exercise 2.1 
Calculate the probability to obtain five successes in 10 attempts having each 
a 20% success probability. 


Answer From Eq. (2.29) one immediately has: 


10! 
b(5; 10, 0.2) = 55 0-2)" 0.8)? = 0.0264 ~ 2.6% . 


Repeating this calculation for 0 < x < 10 and plotting the corresponding prob- 
abilities, the representation of the binomial distribution of Fig. 2.4 is obtained. The 
binomial probabilities can be also obtained with the R routine dbinom(x,n,p) 
(see also Table B.2). For example, to obtain Fig. 2.4, one can write: 

x <- €(0,1,2,3,4,5,6,7,8,9,10) 


y <- dbinom(x,10,0.2) 
plot (x,y,type='p’,pch=’+'’) # points are drawn as small + 


Exercise 2.2 
10% of the parts produced on an assembly line are defective. Calculate the 
probability to have 2 defective pieces over a total of 40. 


Answer Since n = 40, p = 0.1 and x = 2: 


b(2; 10, 0.1) = i) (0.1)7(0.9)*8 = 0.142. 


2.6 Binomial Distribution 53 


Fig. 2.4 Bar plot of the 
binomial distribution o3b b(x;10,0.2) 
b(x; 10, 0.2) : 


0.25- 


0.2 


Exercise 2.3 
Assuming that the probability of having both male and female children is the 
same, calculate the probability of having five daughters. 


Answer Since p = (1 — p) = 1/2,n = 5, x =5, one has: 


b(5; 5, 0.5) = (2) Se = ol On 


Exercise 2.4 

The likelihood that a child will contract scarlet fever in preschool age is 35%. 
Calculate the probability that, in a classroom of 25 pupils, 10 pupils already 
had scarlet fever. 


Answer Since p = 0.35,n = 25, x = 10, one has: 


b(10; 25, 0.35) = ee) (0.35)!°(0.65)!5 = 0.141. 


We can now compare Table 2.2 and Fig. 2.1 with the predictions of the probability 
theory. To do this, we just calculate the probabilities b(x; 10, 0.5) forO0 < x < 10. 


54 2 Representation of Random Phenomena 


Table 2.3 Results of the experiment of Table 2.2 compared with the predictions of the binomial 
law. The theoretical values, obtained by inserting n = 10 and p = 1/2 in Eq. (2.29), are reported 
in the third column, whereas the fourth and the fifth columns contain the cumulative frequencies 
and the probabilities calculated by inserting the probabilities of the third column in Eq. (2.27) 


Spectrum Cumulative Probability 
(heads) Frequency Cumulative probability frequency 
0 0.00 0.001 0.00 0.001 
1 0.00 0.010 0.00 0.011 
2 0.05 0.044 0.05 0.055 
3 0.13 0.117 0.18 0.172 
4 0.12 0.205 0.30 0.377 
5 0.25 0.246 0.55 0.623 
6 0.24 0.205 0.79 0.828 
i) 0.14 0.117 0.93 0.945 
8 0.06 0.044 0.99 0.989 
9 0.01 0.010 1.00 0.999 
10 0.00 0.001 1.00 1.000 


Obviously, we assume that coins are not rigged and that the probability of having 
head (or tail) is constant and equal to 1/2. Results are shown in Fig. 2.3 and in 
Table 2.3. Notice that there are significant differences between the experimental 
frequencies and the theoretical probabilities. If they were only due to the limited 
size of the sample (100 tosses), we could be confident that, in the limit of an infinite 
number of tosses, we would obtain the values given by the binomial distribution. In 
such a situation, one says that theory and experiment are in agreement each other 
within the statistical fluctuations. If the differences were instead due to a wrong 
probabilistic model (which would happen in the case of correlated coin tosses, 
rigged coins or coins with memory, etc.), we should talk about systematic or non- 
statistical differences between theory and model. Only statistics can tell us whether 
the fluctuations under discussion are statistical or systematic. Without statistical 
notions, which are not intuitive at all, even a truly trivial experiment like coin tossing 
cannot be correctly interpreted! If you take our word for it, we can tell you that 
the differences between the binomial distribution predictions and the results of our 
10 coin experiment are only due to statistical fluctuations and that the agreement 
between theory and experiment is good. 


2.7 Continuous Random Variables 


When a random variable can vary continuously inside a real interval, finite or 
infinite, the definition of the density function must be done with caution and by 
successive steps. The procedure is similar to that often done in physics, when, 
for example, one moves from a set of point electric charges within a volume to a 


2.7 Continuous Random Variables 95 


charge density, which gives the effective charge in a region of space only when it is 
integrated on the corresponding volume. 

Functions or distributions of continuous random variables with points of disconti- 
nuity can be sometimes encountered. However, in practice, continuous distributions 
are much more frequent and we will then only analyse this type of function. 
Therefore, if we want to calculate the probability of a certain value of X, since 
{X =x} Cc {x —e < X < x} forall e > 0, from (1.14, 2.11) we get: 


P{X =x} < P{x -—e< X <x} = F(x) —- F(x -8) 
for all ¢ > 0. Hence: 


lim [F (x) — F(x — €)] = 0, 


from the continuity of F(x). As a consequence, the probability of an assigned x 
value is always zero. This result is intuitively compatible with the concept of classi- 
cal and frequentist probability: the continuous spectrum includes an uncountable 
infinity of values and the probability to get in one trial exactly that x value on 
an infinite set of possible cases (a priori probability) or over an infinite number 
of occurrences (frequentist probability) must be zero. Therefore, for a continuous 
variable X, only the probabilities to fall within a finite interval are meaningful. The 
following equalities then hold: 


Pla <X <b}=Pla<X <b} =Pla<X <b) =Pla<X <b}, 


which also show that the inclusion of the extremes of the interval is insignificant. 
Consider now a continuous variable X assuming values in [a, b]. Let us divide this 
interval into sub-intervals [a = x1, x2], [x2, x3], ..., [tn-1,Xn = Db] of amplitude 
Ax, and define a discrete random variable X’ which takes values only in the mean 
points x; of these intervals and let 


px (Axg) = px (x;,) = Pixe < X < xe+1} 


be the density function of X’, giving the probability of X to fall into the k-th interval. 
Since the density px/(Ax,) is a function of the bin amplitude, it depends on the 
arbitrary choice of Ax, because the spectrum is continuous. However, it is possible 
to define a function p(x) as: 


Px (Axg) = PUXRAXE , (2.30) 


56 2 Representation of Random Phenomena 


where x; is a point internal to the k-th bin (for instance, the middle point). According 
to the Riemann integral, we can write Eq. (2.25) as: 


lim )° py (Axx) = lim Y pwn = f poyax = 1, (2.31) 
Ax—0 & Ax—0 k 
k—>0o k— 00 


The function p(x) so defined is the probability density function of the continuous 
random variable X. 


Definition 2.7 (Probability Density Function (p.d.f.) for Continuous Variables) 
The probability density of a continuous variable X € R is a function p(x) > 0 
satisfying, for any x, the equation: 


F(x) = [ p(t) dt . (2.32) 


The probability to obtain values into the interval [x;, xx] of width Ax is given by 
(see also Fig. 2.5): 


Xk+1 


~ p(x) dx , (2.33) 


PieexX halo PGi) = FopS / 


Xk 


and the normalization property (2.25) in this case is written as: 


+00 
/ p(x)dx =1. (2.34) 


—oo 


Fig. 2.5 Probability density F(x) 
function (lower plot) with its 


cumulative distribution rele 

function (upper plot) for 0.8 |— 

continuous random variables. 06 |— 

The shading shows that the a 

relation between the areas of 0.4 [— 

the density function and the 0.2 ;— 

increments of the ordinates of fa : : 

the cumulative function Xx 

p(x) 


2.8 Mean, Sum of Squares, Variance, Standard Deviation and Quantiles 57 


Fig. 2.6 Assignment of a 


probability P(A) to an event pealaass 

A of the sample space S. The pample 

random variable X (a) = x space S spectrum 

transforms the experimental x€ Ro sums or integrals 
result a € A to a codomain of of densities and 
the real axis x € Ro, called distributions 
spectrum or support. The =p 
density and cumulative 

functions allow the X(a)=x 0 1 
calculation of P(A) starting eauaain sacle } 6] 
from the spectrum values P(A)=P{xE R,} 


If the interval is small compared to the range of the function variations, we can 
approximate p(x) with a straight line within each Ax (linear approximation) and 
Eq. (2.33) is replaced by Eq. (2.30), where x, is the bin midpoint. 

We can therefore symbolically indicate the transition from the discrete spectrum 
to the continuous one as: 


Yo par) > if p(x)dx , p(k) > plxr)dx , 
k 


from which we see that we move from a function with discrete values to the product 
of a continuous-valued function and a differential. The differential quantity p(x) dx 
gives the probability to obtain X values within [x, x + dx]. 

In the points where F(x) is differentiable, from Eq. (2.32) one also obtains the 
important relation: 


dF (x) 


p(x) = aos (2.35) 


Density and cumulative functions have as domain the spectrum of the random 
variable and as codomain a set of real values. They can be put in correspondence 
with laws or distributions of Eq. (2.6), as shown in Fig. 2.6. 


2.8 Mean, Sum of Squares, Variance, Standard Deviation 
and Quantiles 


The description of a random phenomenon in terms of mean and variance is less 
complete than the characterization of its density function, but it has the advantage of 
being simpler and often adequate enough for many practical applications. Basically, 
the mean identifies where the centre of gravity of values of the X variable is 
localized, while the standard deviation, which is the square root of the variance, 
gives an estimate of the dispersion (spread) of values around the mean. 


58 2 Representation of Random Phenomena 


To help your intuition, we first consider mean and variance of a set of data 
(although this topic will be explored further on, in statistics) and then mean and 
variance of distributions. We start by introducing a notation that will accompany us 
throughout the text. 


Definition 2.8 (True and Sample Parameters) Mean, variance and standard devi- 
ation of a random variable X will be indicated in Greek letters; sample and 
experimental parameters, coming from finite experimental samples, will be indi- 
cated in Latin letters. The probability is an exception, because its true value will 
be always indicated with p, whereas for the relative frequency we will use the f 
symbol, which is used by most statistics book and avoids conflicts of notation. 


We now define mean and variance of a set of experimental data: 


Definition 2.9 (Mean, Sum of Squares and Variance of a Data Sample) If 
x; Gi = 1,2...N) are N occurrences of a random variable, mean, sum of squares 
(SS) and variance are defined as: 


1 N 

n=— Xi, (2.36) 
x i=] 
N 

SS= ii -w”, (2.37) 
i=1 
1 N 

Sn = ay Hi BY (2.38) 


where ju is the true mean assigned a priori. When the sample mean m of Eq. (2.36) 
is considered, the variance is given by: 


N N 
Yi — my? Yi@— my 
i=1 N i=l 


a é 
= a i ee 2,39 
*m (V—1 N-1 N vo) 


You may have noticed that in the denominator of the variance about the sample 
mean the term N — | appears instead of N: the reason, conceptually non-trivial, is 
statistical by nature and will be explained later on, in Chap. 6. From a practical point 
of view, the N/(N — 1) factor is relevant for very small samples only. When this 
difference is neglected, we will write s* = s? 2 


— 88 


m— wu 


2.8 Mean, Sum of Squares, Variance, Standard Deviation and Quantiles 59 
Later on, we will use the following property of the sum of squares: 


SS = SoG = pe" = & —m+m spy 


i i 


= DG — mm)? + N(m = py? +2) Ga — mm — w) 


= SoG —m)?+N(m— py = SS,+ SS, , (2.40) 


L 


since: )0,(xj —m) = 0; x1-—m 0; = 90; x1 Nm = 0. Therefore, the total sum of 
squares is the sum over the sample SS, (sometimes called residual sum of squares 
SS; or RSS) and of N times the squared deviation of the sample mean from the true 
one (sometimes called explained sum of squares S'S; or ESS). 

The square root s = o/s? is the sample standard deviation or root mean square. 
This operation is needed to measure dispersion with the same units of the mean. 
Also variance can be considered as the average of the squared deviations (x; — 4)”. 

When the values of a discrete random variable X are collected in histograms, 
mean and variance can be calculated as: 


Cc 
So nexk é 
k=1 
ae (2.41) 
Cc 
Yo nex = 2) e 
= SE Sn 0) fi (2.42) 
k=1 


where C is the number of histogram bins and nx, f; and e x, are the bin content, 
relative bin frequency and bin midpoint, respectively. 

In Eqs. (2.36—2.38) the sum is the overall raw data, for example, (3 +3+4+ 
2+4+...), whereas in Eqs. (2.41, 2.42) the same sum is evaluated in the compact 
way (2+2-3+2-4+...). Therefore, the two estimates give exactly the same 
result for a discrete spectrum. Instead, when spectrum is continuous, Eqs. (2.41- 
2.42) can still be used by assigning the bin midpoints to x;; in this case the values 
of the continuous distribution are approximated in a discrete way. 

We now define mean, variance and standard deviation when we know a priori the 
probability density of a variable X. The formulae, for a discrete random variable, 
are nothing more than the generalization of (2.41, 2.42), where the measured 
frequencies f; are replaced by the a priori probabilities pz of the density function. 
For a continuous random variable, the formulae are derived with the same limit 
operation of Eq. (2.31). We therefore obtain the following: 


60 2 Representation of Random Phenomena 


Definition 2.10 (Mean and Variance of a Random Variable) The mean (or 
expected value) jz and the variance o* of a random variable X are given by: 


foe) 
w= ae Px (for discrete variables) , 
k=1 


+00 
= / xp(x)dx (for continuous variables) , (2.43) 


—cC 


Cc 
a = bee. = yu)” Px (for discrete variables) , 
k=1 


+00 
= / (x — pb) p(x) dx (for continuous variables) . (2.44) 


—co 


As in the sample case, the standard deviation is given by o = Vo”. 

The definition (2.43) is assumed to be valid only if there exists the limit of the 
series or integral of absolute values. Since px, p(x) = 0, this condition is equivalent 
to write: 


[o.@) +00 
Yo [xKl Pk < 00, / |x| p(x) dx <0. (2.45) 
k=1 


—coO 


Equations (2.45) obviously imply the convergence of Eqs. (2.43). Here, also the 
inverse property is requested because in probability theory one always assumes that 
the order of summation of terms in infinite series is indifferent and equalities as 
Ye Xk F (XK) Yo; Vigo) = Kix Xe yi f (4) (%) are verified. These properties then 
require the absolute convergence of series, and the same holds also for integrals. In 
the following, the existence of mean values will always imply also the existence of 
absolute mean values. 

As for the variance, the definition is obviously valid if the series or the integrals 
(2.44) (which have always positive values) are not divergent. There are, however, 
some cases of random variables with an undefined variance; in this situation, 
variance cannot characterize the data dispersion and the density or cumulative 
functions must be used. 

It can be shown that the sequence of moments (see Appendix C): 


YS oeik pi : [rma (k = 1,2...00) 


allows the unique determination of the distribution function of a generic random 
variable X [Cra51]. The k = 1 moment is the mean, and hence, once the other 


2.8 Mean, Sum of Squares, Variance, Standard Deviation and Quantiles 61 


moments are known, also the central moments about the mean can be found: 


Ap = @ —p)* pi, Jo —p)‘p(x)dx = (k= 1,2...00). (2.46) 


Notice that A> = o? and that the first moment A, is always zero: 


A\ = oi — 1) pi => p= pO, 
i 


i 


The first two more significant moments are just the mean and the variance. As we 
will see, they are sufficient to study univocally or, at least, in an acceptable way the 
statistical distributions describing almost all of the practical applications commonly 
considered. In the following, we will rarely use moments beyond the second order. 

The mean is perhaps the most effective parameter for evaluating the centre of 
a distribution, since the second order moment, if calculated about the mean, is 
minimal. Indeed: 


dA 


d 2 
pie eae = =-—2 —_ , 2.47 
alae d (Xk — LL) Pk [y Xk Pk | (2.47) 


and this derivative is zero only when jz is the mean (2.43), so that Az = o. 
If one calculates the squares in Eqs. (2.39, 2.44), the variance can be expressed 
as the difference between the mean of squares and the square of the mean: 


Cc 
Si = DC Fexe) — we, (2.48) 
k=1 
N 2 
2 = a a a | (2.49) 
N = 
eles a fix?) - | ; (2.50) 
Nei | 2 
Cc Leg 
os S-(rexp) es / x? p(x) dx — pw. (2.51) 
k=1 ee 


These equations are sometimes useful in practical calculations, as shown in 
Exercise 2.6. 

The moments of a density function, including variance, are independent of the 
position of the mean, that is, invariants under translations along the x axis. This 
property is obvious (but important), since the intrinsic width of a function cannot 
depend on its position along the axis of abscissas. This can be demonstrated in a 


62 


2 Representation of Random Phenomena 


formal way, by defining a generic translation: 


x =x+a, pwo=puta, 


dx’=dx, p(x) > p(x’ -a)= p'(x’), 


and by verifying that: 


An(x) = i (x — pw)" p(x) dx = if (x! — w')" pa! — a) de! 


7 Jo — WY" p(x!) dx! = An(x'). 


(2.52) 


In summary, mean, variance and moments of Eq. (2.46) for histograms of experi- 
mental samples or for random variables are given by: 


a 
m= + SiXk 
k=l 


[o,@) +00 
= a DPkXk > / xp(x) dx , 
k=1 a 


a 
ll 


Cc 
= > Sela — wy, 


k= 


a 


N Cc 
2 2 
Sm = WW] 2 file —m) , 


[o,e) +00 
>= palm) > fe wPpenar, 


—— 

k=1 

Cc 

Dn = Yo fare — W)". 

k=1 

[o.@) +00 
An = >_ Pear — bw)" > | (x — 4)" p(x) dx , 

—0o 


(2.53) 


(2.54) 


(2.55) 


(2.56) 


(2.57) 


(2.58) 


(2.59) 


where arrows denote the transition from a discrete to a continuous variable. 
In R, the calculation of mean and variance from a set x of raw data 
according to Eqs. (2.36, 2.39) can be done with the functions mean (x) and 
var (x). If, instead, data are in the form of histograms, the calculation can be 
performed with Eqs. (2.54, 2.57) and our routines MeanHisto(x,fre) and 


2.8 Mean, Sum of Squares, Variance, Standard Deviation and Quantiles 63 


VarHisto (x, fre), where x is the spectrum of the considered random variable 
and fre is the corresponding vector of frequencies or number of occurrences. 

In data analysis, the so-called Q-Q, quantile-quantile, plot (see Definition 2.5) is 
often used to compare two probability distributions. If x; and y; are the elements 
of two vectors containing the values of two random variables X and Y sorted in 
ascending order, the position of these elements, divided by the length of the vector, 
is close to the theoretical quantile of the parent distribution function. For example, 
in a vector of 100 sorted variates, x(40) will be the quantile value close to 0.4, 
because it has 40 out of 100 values to its left, and so on. Using R, we can generate 
two random vectors from the uniform distribution (defined later; see Eq. 3.79), 
which returns values between O and 1 with constant probability, through the 
commands: x<-sort (runif(100)) and y<-sort (runif (100) ), which 
sort the values of the two vectors in ascending order. If we now represent, on the 
two cartesian axes x and y, the points corresponding to the pairs of the values of 
these two vectors having the same position, i.e. (x1, y1), (x2, y2),.--, we obtain 
the Q-Q plot for the two samples shown in Fig.2.7. This plot has been generated 
with the simple R command qqplot (x,y), followed by the inline commands 
grid()and abline (0,1). As clearly shown by this figure, data tend to cluster 
in the plane around the diagonal y = x. Indeed the Q-Q plot is used to check 
whether data from two samples come from the same parent population: the more 
the homogeneity hypothesis is true, the closer the quantiles of the two variables are 
and, hence, the more they cluster around the diagonal. It is also possible to compare 
data sample with a theoretical population: just place these data, sorted in ascending 
order on an axis, calculate the sample quantile probabilities and place the corre- 
sponding theoretical quantile values on the other axis. In our example with uniform 
variates, this can be performed by sorting x with the command x<-sort (x), 
by calculating the probabilities and the theoretical quantiles of the experimental 
sample with pth<-seq(0,1,by=1/length(x) ) and qth<-qunif (pth) 


Fig. 2.7 Q-Q plot of 
quantiles between two 
uniform variates x and y 
generated with R. The 
continuous line is the straight 
line y = x 


64 2 Representation of Random Phenomena 


(note that pth=qth for uniform variates), and, finally, by plotting the result with 
qaplot (qth, x). You will see that these pairs of data cluster around the diagonal. 

In R, the qqnorm routine generates the Q-Q plot between a sample and the 
normal distribution, as we will see shortly. If random variables are discrete, the 
ordering no longer ensures the correct determination of the quantile, due to repeated 
values. This situation is discussed in Problem 2.11 


2.9 Operators 


As we have seen, the formulae for the calculation of mean and variance assume 
different forms, depending whether we have finite or infinite datasets, discrete or 
continuous random variables, raw data or histograms. 

For this reason, it is convenient to consider mean and variance not just as numbers 
but also as operators on random variables or sets of data, in which the type of 
operation that is being carried out is given regardless of the particular representation 
used for data or variables. 

In the following, we will indicate in italics and lowercased letters: 


es WD; Gy Sy SO), sex 66Sy 


means, variances and standard deviations of variates of X obtained in a particular 
trial or sampling. Notice the symbol (x) for the mean, a very common notation 
among physicists and engineers. 

The operators (...) or E[...] refer to the mean, whereas Var[...] indicates the 
variance. The random variables on which the operators act will be always indicated 
in capital letters: 


(X), E[X],  Var[X]. (2.60) 


We will use mainly the notation (X), Var[X]. The standard deviation in operator 
form will be indicated as ./Var[X] = o[X]. 

At this point, we see from Table 2.1 that the functions that operate on random 
variables are probabilities and operators: the random variable is therefore always 
enclosed in curly or square brackets. We will use this convention throughout the 
rest of the book. 

To ease the notation, we will sometimes indicate mean, variance and standard 
deviation as: 


(X)=py=pm, Val[X]=o2=07, o[X] =o, =0 (2.61) 


(note the subscripts written in lowercase letters). The writing jw and o” (or [lx and 


o if you need to specify the variable type) is intended as the numerical result 


of the correspondent statistical operator. Anyhow, the notation with Greek letters 


2.9 Operators 65 


(true values) uniquely defines the type of operation performed, avoiding possible 
confusions. 

It is easy to verify, from Eqs. (2.53—2.57), that mean and variance operators have 
the following properties: 


Var[X] = ((x = (x))?) (2.62) 
(aX) =a (X), (2.63) 
Var[aX] = a?Var[X] , (2.64) 
(X+a)=(X)+a, (2.65) 


= (cx 4a —(X)— a)’) = Var[X] , (2.66) 


where @ is a constant. The last two equations show, of course, that the average of 
one constant coincides with the constant itself and that there is no dispersion for a 
constant. Using Eqs. (2.62, 2.63), we can rewrite in operatorial notation Eq. (2.51), 
which defines variance as the mean of squares minus the square of the mean: 


Var X] = ((x — 4)"| 
= ((x - (x))") 
= (x?) — 2.(X) (X) + (x)? 
= (x?) - (x)? . (2.67) 


The mean operator allows the definition of the true mean value of any function of 
random variable (2.8): 


Definition 2.11 (Expected Value) If f(X) is a function of a random variable X 
with p.d.f. p(x), the expected value of f(X) is the quantity: 


(f(X)) = ELFOOl = Do fax) pO) > [re p(x) dx , (2.68) 
k 


where the arrow indicates the transition from a discrete to a continuous variable. 


Functions of random variables will be treated in detail later, in Chap. 5. According 
to this definition, the true mean jz can be considered as the expected value of X. The 
expected value is also known as the expectation, mathematical expectation, mean, 
average or first moment. 


66 2 Representation of Random Phenomena 


As in the case of Definition 2.10, the existence of the sum )°, | f (xx) |p(xx) 
or of the integral {| f(x)|p(x) dx is implied in the definition of the expected 
value of f(X). Therefore, one always assumes that a variable or function of 
variable, for which one defines the expected value (2.68), is absolutely summable or 
integrable on the corresponding p.d.f. This implies, for a continuous variable, that 
the probability density tends to zero for x — oo at least like 1/|x|* with a > 2. 
All densities which we will consider later have this property. 


Exercise 2.5 

Consider the space € where the event (A) can occur as x; with probability 
Pi and the event (Az), independent of the previous one, can occur as x2 with 
probability p2. Find the mean of the sum (X; + X2) where X; and X2 are 
dummy variables defined as X;(A1) = x1, X2(A2) = x2 and X1(Aq) = 
0, X2(A2) = 0. 


Answer The spectrum of the sum is given by the four values (0+ 0), (0+ x2), 
(x1 +0), (x1 + x2), having, from Theorem (1.21), probabilities: (1 — p,)( — 
p2), (1 — pi) p2, pi — p2) and pi p2, respectively. From Eq. (2.54) one has: 


pw = (1 — pi) — p2)(0+0) + 1 — pr) p20 + x2) 
+ pi(1 — p2)(x1 +0) + pi p21 + x2) 
= pixi + p2x2. 


Since (X1) = (1 — pi) -0+ pix = pix1, and the same holds for X2, one 
can write: 


(X1 + X2) = (X1) + (Xo) , 


that is, the mean of a sum is equal to the sum of the means. 


Exercise 2.6 
Find mean and standard deviation of the data shown in Table 2.2, and compare 
them with the values given by the binomial distribution. 


Answer By applying Eq. (2.53) to the data of the first and third columns of 
the table, we obtain: 


(x) =m= (2-0.054+3-0.13+---+9-0.01) =5.21, 


(continued) 


2.10 Simple Random Sample 67 


Exercise 2.6 (continued) 
for a total of 521 heads over 1000 tosses. The mean of squares is given by: 


(x?) = 40.05 +9-0.13 +-+-+81- 0.01) = 29.7. 


From Eq. (2.50) we then have 


e= = ({x?} = (x)?) = ap 27 — 5.212) = 2.48 , 


8 = V2A3 = 1.57 « 


By applying the same procedure to the true probabilities of the fourth column, 
we get: 


(X) = 5.00, (x?) = 975: 
VatlX] = 0? = (x?) i 015 2 se 
o = V25=1.58. 


Notice the absence of the 100/99 factor in the calculation, since here the 
variance is evaluated with respect to the true mean. Again, the difference 
between the experimental values m = 5.21,s = 1.57 and the theoretical 
ones “ = 5.00, 0 = 1.58, will be explained in Chap. 6. 


2.10 Simple Random Sample 


Consider an experiment involving a random variable X and N independent obser- 
vations. The result of this operation is an N-tuple of independent variables 
(X1, X2,..., Xn), for which condition (2.9) applies. We then arrive at the following 
definition. 


Definition 2.12 (Simple Random Sample) The set of N independent variables 
(Xi, X2, ..., Xn) coming from the same probability density function p(x) is 
called simple random sample of size N, extracted from the parent population of 
density p(x). The variables are called independent and identically distributed and 
sometimes designated with the iid acronym. 


The correct, but somewhat long, definition of “population of density p(x)” is 
sometimes abbreviated to “population p(x)”. The concept of population, introduced 
on an intuitive basis in Sect. 1.2, here assumes a precise meaning. For a discrete 
random variable, the probability to obtain a certain set of random variates is, 


68 2 Representation of Random Phenomena 


from Eq. (2.9): 


N N 
P{X1 = 4x1, X2=32,....Xn =aw} =] [P(X =x} =]] poi). 2.69) 
i=] i=l 


To better understand the concept of random sample, we must refer to a situation 
where a random variable X is sampled repeatedly, according to the following 
scheme: 


first sample x}, X45, ..., Xy > m1, se 
d 1 " " ” 2 
second sample x,, Xj, ..., Xy —> m2, 83, 
(2.70) 
: m m m 2 
third sample x,,x,,.-.,Xy m3, 53, 
Dh: 
all samples Xi, X2,..., Xn 7M, S“; 


X, is the random variable “‘occurrence of the first trial” or “first element’, whereas 
Ww . . . . 

x, is the random variate resulting from the first trial in the second sample. The 

values m; ed oy are the mean and variance of the i-th sample: 


_ 2 
ma yitt Ce ae 
k k 


These values, which estimate the corresponding true quantities 4. and o” from a 
sample of finite size, are called estimates of mean and variance. If we repeat the 
experiment or sampling, we will get different means and variances: the sample 
values m and s therefore have to be considered as realizations or variates of the 
random variables M and S, indicated in the last line of Eq. (2.70), which are 
functions, in the sense of (2.8), of the variables X;: 


m=. pa (2.71) 


i 


The variables M and S* are sample functions. In general, any of these quantities is 
called a statistic. 

Definition 2.13 (Statistic) For a given sample (X1, X2,..., Xn) from a density 
p(x), any function 


T =t(X1, X2,..., Xn) (2.72) 


2.11 Convergence Criteria 69 


which does not contain any unknown parameter, is a random variable called statistic 
(singular). 


Sample mean and variance are two examples of statistic. As we will see in the next 
section, they are also estimators of the mean and of the variance. 


2.11 Convergence Criteria 


At this point, it is necessary to well specify the meaning of the limits and of the 
convergence criteria used in the study of random variables. 

As we have mentioned before, the frequentist limit of Eq. (1.3) is applied to the 
realizations of the random variable: it does not have a precise mathematical meaning 
and indicates that trials must be repeated an infinite or finite number of times, until 
the population is used up. This is the meaning to be attributed to a limit whenever 
it is applied to a sequence or sum of values of a variable (lowercase notation). We 
called this operation frequentist limit. However, expressions such as: 


1 
ae W 2 Aim = Pua dil = 2 sP = (wrong!) , 


should be avoided, because precise mathematical quantities are present on the right 
side, whereas we have an undefined limit on the left. We point out, however, that 
this limit is justified in the frequentist interpretation (1.3) and that it reproduces 
qualitatively what actually happens in many observations. For example, if we 
consider the toss of a dice where {X = 1,2,3,4,5, 6} and we assign to each 
face a probability equal to 1/6, according to Eq. (2.43) uw = (1l+243+4 
4+5+ 6)/6 = 3.5. Experience shows that, when rolling a die and averaging 
progressively the scores according to Eq. (2.36), the result tends to 3.5 as long 
as the number of rolls is increased. For instance, after having arbitrarily assigned 
a probability of 1/6 to each face, we simulated a dice roll using a computer and 
obtained, for N = 100, 1000, 100,000, 1,000,000, the sample means m = 
3.46, 3.505, 3.50327, 3.500184, respectively. 

For a mathematically rigorous study of the random phenomena, it is however 
necessary to establish, as the sample size increases, the type of convergence that 
might occur in sequences of random variables, and the extent of the deviations from 
the limit value. 

A first rigorous definition states that a succession of variables Xv converges to a 
variable X if, given any € > 0 however small, the probability that Xj differs from 
X by a quantity > € tends to zero for N — oo: 


a P{|Xn(a)—-X(a)| =e} =0, or a P{|Xn(a)—-X(@)| <e}=1 
(273) 


70 2 Representation of Random Phenomena 


This limit, which fulfils the usual properties of the limits of real number sequences, 
is called limit in probability or weak convergence. 

At this point, we draw your attention to a subtle distinction: the limit of Eq. (2.73) 
does not ensure that all the values |X y (a) — X (a)|, for each element a of the sample 
space, will be less than € above a certain N, but only that the set of values exceeding 
e€ has a vanishing probability to exist. 

If one requires that for the most part of the sequences the condition |X y(a) — 
X(a)| < € holds for any a, the almost sure or strong convergence on the set of 
elements a of the probability space (S, F, P) must be introduced: 


Pt hm. | Ay le) = A1@) Sey Sls (2.74) 


In this case, we are sure that there is a set of elements a, converging to the sample 
space S for N —> o, such that the sequence of real numbers Xj (a) tends to 
the standard mathematically defined limit. The convergence in probability and the 
almost sure convergence are graphically represented in Fig. 2.8. 

When Xv is the sample mean estimator and X (a) = w, Eqs. (2.73) and (2.74) 
are called weak and strong law of large numbers, respectively. 

The last type of convergence we consider is the convergence in law or distribu- 
tion. A sequence of random variables Xy converges in distribution to a variable X 
if: 


pa Fy, (x) = Fx(x), (2.75) 


N N 


Fig. 2.8 (a) Convergence in probability: the set of values |X y (a) — X (a)| higher than an assigned 
value € tends to have vanishing probability for N —> oo. However, this does not prevent to have 
some points outside the limit (denoted by arrows). (b) Almost sure or strong convergence: most of 
sequences satisfies the inequality |X yy (a) — X(a)| < € except for a set of elements a € S having 
null probability 


2.11 Convergence Criteria 71 


for any point x where Fy (x) is continuous; here Fy, and Fy are the corresponding 
distribution or cumulative functions. This limit is widely used in statistics when 
studying the type of distribution followed by statistical estimators. It can be 
demonstrated (as, indeed, it is quite intuitive) that the almost sure convergence 
implies the convergence both in probability and in distribution and that convergence 
in probability implies the one in law, while the opposite is not true. Furthermore, 
a fundamental theorem of Kolmogorov, whose proof can be found in [Fel47], 
guarantees the almost sure convergence of sequences of independent random 
variables if the condition: 


e Var[ Xn | 


N2 < +00 (2.76) 


N 


holds. All statistical estimators considered in this book satisfy this property, so 
that, when dealing with statistics, we will not have convergence problems: all the 
considered variables will converge almost surely (and, therefore, also in probability) 
to the requested limit values. The convergence criteria are sometimes important in 
the theory of stochastic functions [PUP02], which will not be considered here. 

In summary, we have described four different limits: the frequentist limit (acting 
on the “lowercase” variables), the one in probability, and the almost sure one (both 
acting on “uppercase” variables) and the one in law, which involves distribution 
functions. 

The random variables considered as limits in Eqs. (2.73, 2.74) can also simply 
be constants, as often happens for the limits of statistical estimators. In fact, a 
statistic Ty (X), function of a random sample of size N according to Eq. (2.72), and 
converging in probability to a constant: 


lim P {|Ty(X) — | >e} =0, (2.77) 
N->co 


is defined as a consistent estimator of ju. 

The theory of statistical estimators, such as those defined in Eq. (2.71), will be 
described in detail later, in Chaps. 6 and 10. However, here we want to explain the 
meaning of mean and variance of an estimator. Since from (2.67) it results that the 
variance can be written in terms of mean operators, it will be enough to discuss only 
the mean of an estimator. For example, let us examine the meaning of the mean of 
the variance: 


1 N 
(Ty) = (> Ki — wy 


i=1 


What is in () brackets is a random variable composed by N observations x; of 
X, combined to calculate their variance. The ( ) parenthesis indicates that one has 
to repeat the procedure an infinite amount of times and to take the average of the 
infinite variances thus obtained. Therefore, in the frequentist view, at first a sample 


72 2 Representation of Random Phenomena 


N’ samples of N observations 


sample of N’ 
variances 


nean for + 
== 


N’ —> infini ay 
my. <. e Wy /N > 


Fig. 2.9 The mean of a variance 


of the variable 5°; (X; — 1)*/N is obtained from a series of N’ samples from N 
observations of the same variable X, and then this last sample is averaged by N’ > 
oo. The procedure is shown in Fig. 2.9. From the properties (2.67) of the mean 
operator, one obtains: 


N 
(de - «) = 0 (0% - 0?) = No?, (2.78) 
i=1 i 


i=1 


that is: 


fr ee = «) =o 
a i=l . 
This last equation can be also written as: 
(Ty(X)) =o? . (2.79) 


The consistent estimators which satisfy this property are unbiased. Basically, for 
this class of estimators, the true mean of a population consisting of Ty elements 
obtained from samples of size N coincides with the limit to infinity (true value) of 
the estimator. 

As it will be later shown in Sects. 6.10 and 6.11, the sample mean (2.53) satisfies 
the properties (2.77) and (2.79), while the variance about the sample mean (2.56), 
without the factor N/(N — 1), does not satisfy Eq. (2.79). 


2.12 Problems 73 


Generally, the study of the asymptotic properties of random variables and their 
estimators is simpler if we apply the mean operator to a set of estimators Ty (X), 
instead of studying directly the estimator limit (2.73) for N — oo. 


2.12 Problems 


2.1 Calculate the probability to obtain 2 times the face 6 by tossing three dices. 


2.2 If one assigns a probability equal to 1/2 to the event head, calculate the 
probability to obtain 3 heads in in 10 tosses. 


2.3 A manufacturer knows that the percentage of defective pieces offered for sale 
is 10%. In a contract, he agrees to pay a penalty if, in one box of ten pieces, more 
than two pieces are defective. Find the probability to pay the penalty. 


2.4 One player wagers 60 euros on a roulette by betting 50 euros on red (X strategy) 
and 10 euros on the black number 22 (Y strategy). Keeping in mind that the numbers 
range from 0 to 36 and that the dealer wins if the 0 hits and that the payout is equal to 
the bet if the red hits and is 36 times the post if 22 is the winning number, calculate 
the average capital value after a bet. What does this result mean? 


2.5 A variable X can take the values {X = 2, 4, 6, 8, 10} with equal probabilities. 
Find mean and standard deviation of the variables X and Y = 2X + 1. 


2.6 Three marbles are extracted, without replacement, from an urn containing three 
red and seven black marbles. Determine the discrete p.d.f., mean and standard 
deviation of the number R of black marbles after three draws. 


2.7 Find probability density p(x), cumulative function F(x), mean and standard 
deviation of a continuous variable X, defined in [0,1], and having a linearly 
increasing density such that p(0) = 0. 


2.8 Find the 25-th percentile of the distribution of the previous problem. 
2.9 To cover the round-trip distance between two points A and B, with a distance 


of 100 km from each other, a car travels at a speed of 25 km/h one way and 50 km/h 
on the way back. Calculate the average speed of the trip. 


74 2 Representation of Random Phenomena 


2.10 Find the total number of heads over a total of 1000 tosses from the data of 
Table 2.2. 


2.11 Generate the Q-Q plot between two vectors of size 100 extracted from the 
binomial distribution b(x; n = 20, p = 0.3). Then, produce the Q-Q plot of these 
generated random numbers versus the expected distribution. 


Chapter 3 M®) 
Basic Probability Theory od 


The question is not so much whether God plays dice, but how 
God plays dice 


Ian Stewart, “DOES GOD PLAY DICE?”. 


3.1 Introduction 


In this chapter we start by analysing the properties of the binomial distribution, 
and, then, we will gradually derive, using probability theory, all other fundamental 
Statistical distributions. We will not avoid important mathematical steps, since we 
believe that this helps to have a general and consistent vision of the described 
topics. This will allow you, while analysing any scientific problem, to immediately 
understand its statistical and probabilistic aspects and to find the solution using the 
more appropriate statistical distributions. 

We will end the chapter with some hints about the use of probability theory in 
hypothesis testing. This topic, which will be fully developed later on, in statistics, 
will allow you to appreciate better what you have learned and to fully understand 
many natural phenomena. 


3.2 Properties of the Binomial Distribution 


The binomial density, introduced in Sect. 2.6, is the p.d.f. of a random variable X 
which represents the number x of successes obtained in n independent trials with 
constant success probability. The properties of this density will allow us to develop, 
by successive stages, the basic scheme of density functions for one-dimensional 
random variables. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 715 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_3 


76 3 Basic Probability Theory 


The binomial density is normalized, since from the formula of the Newton 
binomial coefficient and from Eq. (2.29), we have: 


(2) a= p =p +0- ar =1. G.1) 
x=0 


The mean of the binomial distribution is given by Eq. (2.43), where the sum must 
be extended to all values 0 < x <n, and the probability p, is given by Eq. (2.29): 


n 


. n! x n—X 
w= Dxbinp)=) xa d — p)"~ 


x=0 x=0 


— > n(n — 1)! meal ine 3 2) 
_ “7G—Dia—ay ? a 


where in the last row the change of the sum over x has to be noted since, when 
x = O, the term of the sum is zero. 
If we set in Eq. (3.2): 


x'=x-1, n=n-1, n'—x'’=n—-x, (3.3) 


we obtain, using Eq. (3.1): 


n " 

nN. , yy 

=m Daa i -Py ae. (3.4) 
x'=07°* : 


a result in agreement with intuition, which considers mean as an expected value, 
i.e. the product between the number of attempts (or trials) and the probability of 
each attempt. On the contrary, the value of the variance is not at all intuitive and, 
as we will see later, extremely important. It can be easily obtained by calculating 
the average of the squares, with the same procedure as for (3.2), and again using 
Eq. (3.3): 


1) 2 (Joona Swen (S)eta-ne 


x=1 x'=0 
=np(n'p+lI=npl[(n—lpt+l1]. 
Then, from Eqs. (2.67) and (3.4), one can write: 


VarlX] =n pU(n — 1)p+1]—n2p? =np(1—p). (3.5) 


3.2 Properties of the Binomial Distribution 77 


In conclusion, we have obtained the fundamental equations: 


w=np, o*-=np(l—p), o=/np(Ui—p), (3.6) 


which are one of the milestones of probability calculus. The value of the mean is 
intuitive, while those of variance and standard deviation are not, and it helps if you 
memorize them right now. 


Exercise 3.1 
Find the true mean, variance and standard deviation for the 10 coin experiment 
of Table 2.2. 


Answer Since the experiment is described by the binomial distribution with 
n = 10 and p = 0.5, from Eqs. (3.6) one has: 


w= 10-0.5=5 
o? = 10-0.5(1 — 0.5) =2.5 
Gee Oe 


The results are identical to those obtained in Exercise 2.6, where the basic 
formulae (2.54, 2.51) have been applied to the binomial probabilities reported 
in the fourth column of Table 2.2. 


Exercise 3.2 

The probability of hitting a target is equal to 80%. Find the p.d.f. of the 
number n of trials needed to be succesful. Find also the mean and standard 
deviation of n. 


Answer We have to find the probability distribution of the number n of 
independent trials needed to get one success, when the probability of a single 
success is p. 

In this case the random variable to be considered is no longer consisting 
of the number X of successes in n fixed attempts (leading to the binomial 
distribution) but is the number of attempts n, when the number of successes is 
fixed and equal to x = 1. For the n-th attempt to be successful, we must have 
at first x = 0 successes in (n — 1) trials; the probability of this event is given 
by the binomial density (2.29): 


— 1)! 
bo = 0m 1, p) = 2 a= py == py 


(continued) 


3 Basic Probability Theory 


Exercise 3.2 (continued) 
Therefore, the probability to have a success in the n-th trial will be given by 
the compound probability: 


g(n) = p(l—p)""'," (3.7) 


which is named geometric density. It can therefore be seen as a simple 
application of the law of compound probabilities (1.24), where the two 
probabilities refer to (n — 1) consecutive failures followed by a success at 
the n-th attempt. 

The density is normalized, because from the theory of series we have: 


[o.@) 1 [o.@) 
ee (3.8) 
n=0 ee n=) ee 
so one gets: 
[e,2) [e,2) 1 
pla = apy Sp a1, Usps, 
k=] | P 


Mean and variance can be evaluated from Eqs. (2.43, 2.67) and from the 
properties of the geometric series. By differentiating Eq. (3.8) twice, one 
easily obtains: 


kph! ae Se el eee als (3.9) 
om GID) ie (1 — p) 


From these equations one has: 


(n) =o kp py! =p) kd - pt =—, — B.10) 
k=" ill P 

(x?) =)-Rp(i- py! = — . (3.11) 
k=1 


The variance is given by: 


o2 = (n?) — ype ag (3.12) 


(continued) 


3.3. Poisson Distribution 79 


Exercise 3.2 (continued) 
In this specific exercise, the probability, mean and standard deviation are then: 


g(n) = 0.8- (0.2)""! , 
Un = 12S. 


on=yV d— p)/p = 0.31. 


A mean value jz = 1.25 means that, in 12-13 runs, one will have, on average, 
10 runs in which the first attempt is successful. 


3.3 Poisson Distribution 


The calculation of the binomial density is not easy, for large n, due to the factorials 
appearing in Eq. (2.29). If, in addition to the condition n > 1, the probability of 
the event is small (p < 1), then one will have few successes (x < n) and the 
approximation: 


mn Din = 2). 9 +1) oh 


holds. By writing y = (1 — p)"~* and using logarithms, one has: 


1 
Iny =( — x) Ind — p) ae p(n —x) aad 


np 
y =(1 — p)* = elDY ew | 


After these transformations, the binomial density, for n >> 1 and p < 1, assumes 
the form: 


n! . 
lim b(x;n, p) = lim ————p*(1 — p)"* 
noo n>oco x!(n — x)! 
p—>0 p-0 


= nytt = (py a | (3.13) 
x! x! 


from which, on the basis of Eq. (3.6) and of definition 4 = np, the Poisson density 
is obtained: 


x 


PX; WwW) = —e# : (3.14) 


80 3 Basic Probability Theory 


Fig. 3.1 Poissonian (full 
line) and binomial (dashed 
line) densities for oO 
n=10,p=0.1, co) 
|. = np = 1. Fora better 
comparison, the discrete nN 
values are joined with straight o 
lines 
ro) 
° 
ro) 


It represents the probability to obtain a value x when the mean value is ju. The 
Poissonian is practically considered an acceptable approximation of the binomial 
already starting from yz > 10 and p < 0.1, as shown in Fig. 3.1. The R code which 
reproduces this figure is: 


BinPoisTest<- function (n=11) { 
x <- seq(0,10,length=n) 
y <- dbinom(x,n,0.1) 
z <- dpois(x,lambda=1) 
plot (x,y,type='1',lwd=3,xlab='x’,ylab=' ', 
lty='dashed’ ,font.lab=2,cex.lab=1.5) 
lines (x,z,type='1’,col='’black’,lwd=2) # lines adds z curve to plot 


} 


Exercise 3.3 
In a city of 50,000 inhabitants, an average of five suicides occurs per year. 
Calculate the probability of ten suicides. 


Answer From the binomial distribution (2.29), we have: 


5 510 5 \ 49.990 
(10; 50000, —-—) = aa ee (: : saa! a 
50,000 10) 50,000 \" ~ 50,000 


Alternatively, from the Poissonian we obtain: 


10 ,-5 
p(10; 5) = To ~ 0.018 = 1.8%. 


3.4 Normal or Gaussian Density 81 


The Poisson density is normalized, since: 
pr 
—e tae? —=e "eM =]. (3.15) 


The mean and variance of the Poissonian can be found with the same method used 
in Eqs. (3.2—3.5) for the binomial distribution. We leave the explicit calculation of 
the mean as an exercise, which gives, as expected, the 4 parameter of Eq. (3.14). 
For the variance, from Eqs. (2.67, 3.3), one has: 


Var[X] = > E a _ er 


x=0 
oo) x! ob 
= ! we 
a zs c vs x!! 
x/= 


Therefore, we obtain the result: 


|rwtant nian, 


“L=np, o=L, o=VJ/L, (3.16) 


which shows that, for a Poissonian, the mean and variance are equal. 


3.4 Normal or Gaussian Density 


Another important limiting case of the binomial density is the normal or Gauss 
density. This approximation is possible when the number of trials is huge: in 
this case only the values of the spectrum in the vicinity of the maximum of the 
density are important. Think, for example, about 1000 coin flips: the values of the 
spectrum for the event head range from 0 to 1000, but the important values will 
be concentrated around the expected value 500, which presumably is the value 
of higher probability. You will hardly ever get values such as 5, 995 and so on. 
Therefore, let us consider cases in which: 


n>1, O<p<l, x>1, (3.17) 


and approximate factorials with the Stirling formula, valid for x >> 1: 


Te efx ~ JS Anx x*e* , (3.18) 


x! = J 20xx*e 
where |e,| < 1/(12x). This is already a very good approximation for x > 10, since 
exp[—1/120] = 0.992, corresponding to a relative error of 8 per thousand, as you 
can easily verify. Notice also that now x is, as requested, a continuous variable, 
which approximates the factorial for integer values. 


82 3 Basic Probability Theory 


Making use of the Stirling formula, the binomial density (2.29) assumes the 
form: 
n! 
b(x; n, p) = =——— p*(1 = py" 
x! 


(n—x)! 
2 vin’ 
Sn JSx(n — x)x*(n — x)"-* 


Let us now expand the binomial density in the vicinity of its maximum. Turning to 
logarithms, we get: 


puspy*.. Gla) 


Inb(x;n, p) = Inn! —Inx!—In@ —x)!4+xInp+(m—x)Ind — p). 


The point of maximum is obtained by setting the derivative of Inb to zero, using 
the Stirling formula and then considering x as a continuous variable: 


Dorie b)= ins _ ae —x)!]+Inp —Ind — p) =0. (3.20) 
dx dx dx 


With these approximations, the derivative of In x!, reads: 


d d a d [1 1 
= ina! = =n [exz)2x*e*] == E Ind + 5 Inx + xInx — | 
1 1 x> 
=—4+14Inx-1=—-4I1nx —> Inx. (3.21) 
2% 2x 
Since d(In x!) /dx ~ Inx, Eq. (3.20) becomes: 
d 
qe oe es) lap 0; (3.22) 
x 
and hence: 


nee) 9 p(n—x) 


= =1=>x=pu=np. 
x(1— p) x(1 — p) 


We have obtained a first result: for ~ >> 1, that is, x >> 1, the mean value also 
becomes the maximum. Let us now perform the Taylor expansion up to second 
order around this value. By writing y(x) = Inb, we obtain: 


1 fd? 1 
v(x) = yup) + 5 ae (x — np)? +0 (=) 


Xx n—-xX 


1 1 1 2 
~ y(np) + aie (x —np)’, (3.23) 
x=np 


3.4 Normal or Gaussian Density 83 


where the second derivative is obtained by deriving Eq. (3.22) and the term in 
(1/n*) deriving again the second derivative. After some easy calculations, we get 
the equations: 


7 la@- np) 
yx) = y(np) — ae 
b(xin, p) = &) ~ eP) exp [3] (3.24) 
2np(1— p) 


Since e)"?) is simply the binomial density at the maximum x = np calculated with 
the Stirling formula, from Eq. (3.19) we easily obtain: 
1 1 


b(x =np3;n, p) x iz aa F (3.25) 


and hence: 


1 - E (x — np)’ 
Jin Japp) 2np— p) 


This is an approximate form of the binomial that holds for values of x around uw = 
np, as long as the terms of order higher than | /n in the Taylor expansion (3.23) are 
neglected. The density function g(x; n, p), approximating the binomial for integer 
x values, can be also viewed as a continuous function of x. Indeed, the factor: 


D(x; n, p) ~ g(x n, p) = ‘ (3.26) 


1 
Jnp(l — p) 


of Eq. (3.26) is independent of x, — 0 for n—> oo and has the meaning of 
differential of the variable ¢ given by: 


At = 


x —np 
Jnpd — p) 


By going to the Riemann limit, it is then possible, for large n values, to calculate 
the sum of the probabilities of a set of X values as: 


t= 


x2 


lim P{x) < X <x2}= lim )° d(x: 1, p) 
n—->Oo n—->Oo x 


= 1(x —np)* 
= lim —— ex ———————_ [At 
pe Jin P| ea 
1 2 1 


Sf 2 dt, (3.27) 
V 20 ty 


84 3 Basic Probability Theory 


This fundamental result is known as de Moivre-Laplace theorem. By using jz and 
o from Eqs. (3.6), we then obtain: 


: [- Le | a : [- el 
= exp | —= = exp | -~——~—_ | dx = 1 
J 20 —oo : 2 J/20 o —oo 4 2 o2 


where the result i exp (—z*)dz = ./m has been used (see Exercise 3.4). 
Therefore, the normal or Gaussian density: 


( ) 1 - S| 
x; »o = a > 
BSS Tine | 2 


is normalized and, when it is integrated, allows the calculation of the probability that 
a value of x be in an interval about the mean jz when the variance is 0. This is by far 
the most important p.d.f. of probability theory and should therefore be remembered 
by heart. 


(3.28) 


Exercise 3.4 
Prove that 


+00 
/ exp(—x?) dx = /7 . (3.29) 


Answer By defining 


+00 > +00 > 
f=) odes |) e dy, 
—00 —0o 


+00 p+oo 2d 
Pal / e FY) dxdy. 
—0O —0o 


By converting to polar coordinates, we get: 


+00 20 3 +00 2 9700 
Pe! / eP pdpdd = 2x | e pdp = xe” | i 
0 -0 0 


and hence J = ./7r. Similarly, it can be shown that 


+00 z 
i x? exp (—x”) dx = i F (3.30) 


The mean and variance of the Gaussian distribution are given by the parameters ju 
and o, which explicitly appear in Eq. (3.28). Indeed, from the integrals (3.29, 3.30), 


one has: 


3.4 Normal or Gaussian Density 85 


g(x; 4, 1.55) 


¢ b(x; 10, 0.40) 
0.2 


0.15 


0.1 


0.05 


0 2 4 6 8 10 
x 


Fig. 3.2 Gaussian (full line) and binomial (dotted line) densities for n = 10, p = 0.4 


it is possible to show that: 


_ 1 +00 l(x— p)? _ 

(X) = ‘Sie [. xX exp | dx = LM, (3.31) 
+00 _ 2 

Var[X] = _ [. (x — p)2 exp | dx =o?. (3.32) 


In statistics, we will have to use the 4-th order moment of the Gaussian; through 
integrals of the type (3.29, 3.30), one can demonstrate that: 


1 +00 P| 
Aa = x — w)* exp | —-———*_ | dx = 30%. 3.33 
, — | it) P| oS (3.33) 


The Gaussian density approximates the binomial for 4 >> 1. Thanks to the rapid 
convergence of Stirling approximation (3.18), in many cases an error of less than a 
few percent is made when yz > 10. As a rule of thumb, the Gaussian function can 
replace the binomial when the conditions: 


w=np>10, ni—p)=10 (npractice!) , (3.34) 


hold, so that the approximation (3.23) is valid. If p and 1 — p are not too close to the 
extremes of the interval [0, 1], the condition 4 > 10 is adequate. Otherwise, both 
conditions of Eq. (3.34) must be verified. This property is exemplified in Fig. 3.2, 


86 3 Basic Probability Theory 


where the Gaussian approximation of the binomial distribution b(x; 10, 0.4) is 
reported. In this case, the mean and variance are given by uw = np = 10-0.4=4 
and o? = np(1— p) = 2.4. Even with the values u = np = 4, n(1— p) = 6 < 10, 
the approximation is already acceptable. 


Exercise 3.5 
Calculate the probability to obtain a given number of heads in the 10 coin 
experiment of Table 2.2 using the Gaussian distribution. 


Answer In the ten-coin experiment, the mean, variance and standard deviation 
are given by: 
f=np=10-05=5, 
o” = np(— p) = 10-0.5(1 — 0.5) = 2.50, 
a =J/25= 1.58. 


If we insert these values in Eq. (3.28), the requested Gaussian probabilities 
are obtained: 


GSS 
= 0.252 = = Olaeess10). 
p(x) exp ( 705 ie (x = 0, 1, 0) 


If we report the results in a new column of the table, we get the result: 


Spectrum Number Binomial | Gaussian 


(number of heads) | of trials | Frequency | probability | prob 


0 0 0.00 0.001 0.002 
1 0 0.00 0.010 0.010 
2) 5 0.05 0.044 0.042 
3 13 0.13 0.117 0.113 
4 2 0.12 0.205 0.206 
5 25 0.25 0.246 0.252 
6 24 0.24 0.205 0.206 
7 14 0.14 0.117 0.113 
8 6 0.06 0.044 0.042 
9 0.01 0.010 0.010 
10 0 0.00 0.001 0.002 


(continued) 


3.5 The Three-Sigma Law and the Standard Gaussian Density 87 


Exercise 3.5 (continued) 

which shows that, even with a value np = n(1 — p) = 5 < 10, the Gaussian 
density gives results already in good agreement with the binomial one. For 
this reason, the condition (3.34) is often further extended to: 


b25,nd—p)25. 


You can also study the Gaussian and all the other distribution functions by plotting 
them with R. For this, you should read Appendix B, where you can find some 
suggestions on the use of the R software. 


3.5 The Three-Sigma Law and the Standard Gaussian 
Density 


As we will see in the next section, the Gaussian density, besides being an excellent 
approximation of both the binomial and Poisson distributions, is also the limiting 
density of linear combinations of several independent random variables. For these 
reasons, it certainly represents the most important distribution of probability theory. 
Here we want to discuss a fundamental mathematical property of this function, 
known as the three-sigma (or 30 ) law: 


1 ut+ko 1 e—fh 2 
P{|X — p| < ko} a / : exp 5 (—) dx 
p—ko 


0.683 for k = 1 
= 1 0.954 for k = 2 (3.35) 
0.997 for k = 3 


which, in words, means: if X is a Gaussian random variable with mean (X) = | 
and standard deviation o[X]| = o, the probability to obtain an x value within an 
interval centred on tt and of width +o is about 68%, whereas if the interval width 
is +20, it is about 95%. Moreover, the x values can occur outside an interval of 
width +30 with a probability of 3 per thousand. 

In many practical cases, it is assumed that the spectrum values are all included 
in an interval centred on jz and +30 wide (hence the name of the law (3.35)). 
The probabilities defined by Eq. (3.35) are also called Gaussian levels or values 


88 3 Basic Probability Theory 


of probability. In practice, the evaluation of the integral (3.35) is difficult because 
the primitive E(x) of the Gaussian: 


Ba) = — [ex -2(S4y dt (3.36) 
~ /Ire 0 P 2 Oo : 


is not known analytically (strange, but true!). The integral is usually evaluated with 
numerical methods, by series expansion of the exponential and by numerically 
calculating the limit of the integrated series. In this way the probability levels 
of Eq. (3.35) can then be obtained. At this point, it would seem necessary to 
numerically calculate the cumulative probabilities in a given interval, for each 
Gaussian having a given mean and standard deviation. However, this complication 
can be avoided by resorting to a universal or standard Gaussian, which is obtained 
by defining a fundamental variable in statistics, the so-called standard variable: 


X-— bu ‘ xX— pf 
, which takes the values t= ; (3.37) 
or o 


T= 


and measures the deviation of a value x from its mean in units of standard deviation. 
From Eqs. (2.63, 2.65, 2.66) it is immediate to notice that the standard variable has 
zero mean and unit variance. 

The occurrences t of the random variable T are sometimes called deviates. 

If we now insert the variable (3.37) in the integral (3.36), we get: 


1 ‘a a 
E(x) = = | exp (-5) dt. (3.38) 


It is easy to check that this primitive is related to the well-known error function 
Erf(x), since: 


+x 


V2x 
Erf(x) = = | _exp(-#?) dia 39 —_— = | exp (-: 5) dt =2 E(V2x). 
(3.39) 


This universal function can be calculated once and for all, because it is independent 
of jz and o. Since the Gaussian is symmetric, the conditions: 


E(x) = —E(-x). (3.40) 


3.5 The Three-Sigma Law and the Standard Gaussian Density 89 


holds. If we now consider the change of variable t=/2z and integrate the 
exponential series, we obtain: 


1 x/V2 2 a 


_ 1 3 (-1)* (x //2)24+1 
It ; A(QQk +1) 


The sum of this series is tabulated in all statistical books (in this one, it can be found 
in Table E.1 of Appendix E). The primitive of Eq. (3.38) corresponds to a standard 
Gaussian density: 


eo 2 (3.42) 


1 
(x; 0, 1) = 
J 20 
with zero mean and unit standard deviation. A distribution having Gaussian density 
(3.28) is called normal and often denoted by N (1, 07); the corresponding random 
variable sometimes is denoted by X ~ N(u, o*), or T ~ N(O, 1) when it is standard. 
The symbol ~ means “distributed as’. In R this function is called dnorm (x). 


The cumulative function of the standard Gaussian of Fig. 3.3 is usually indicated 
as D(x): 


@(x) = =|. e* 2 dy | (3.43) 


The link with the error function (3.38) is given by: 
P(x) =0.5+ EE), (3.44) 


which is valid also for x < 0, thanks to Eq. (3.40). (x) is present in R with 
the function pnorm (x), which can be used as an alternative to Table E.1 using 
the call pnorm(x) -0.5, which gives a result accurate up to seven significant 
digits. In summary, as shown by the following examples, the calculation of Gaussian 
probabilities can always be traced back to the case of the standard Gaussian of zero 
mean and unit variance, for which there is a universal table. 


90 3 Basic Probability Theory 


Fig. 3.3. Standard Gaussian g(x; 0, 1) (3.42) and corresponding cumulative function @(x) (3.43) 


Exercise 3.6 
Find the probability P{u —o < X < w+o} with 4 significant digits. 


Answer In the two extremes of the interval to be considered, the standard 
variable (3.37) is equal to +1; from Table E.1, given in Appendix E, for tf = 
1.00 (1.0 reads in the row and the last zero in the column heading) we have 
P = 0.3413. Since the table gives the integral between zero and t, we have to 
double this value. We get therefore Pi{u—o < X <uw+o}=2~x 0.3413 = 
0.6826, according to Eq. (3.35). Note that the table only provides values for 
0 < t < 0.5, since the standard Gaussian is symmetric about the origin. 


Exercise 3.7 
Find the probability to obtain values X > 12 or —2 < X < 12 froma 
Gaussian N(5, 9), that is, with mean jz = 5 and standard deviation o = 3. 


Answer In this case the standard variable (3.37) is: 


(Paes) =2=9 


(= SS 4 23g CS S =2,33) . 
3 


(continued) 


3.6 Central Limit Theorem and Universality of the Gaussian Curve 91 


Exercise 3.7 (continued) 
From the cumulative ®(x) of Fig.3.3, we see that the probability of occur- 
rences x > 2.33 is very small. We can obtain a precise value from Table E.1, 
where, for x = 2.33, we find the number 0.4901, which is the probability 
to obtain normal deviates between 0 and 2.33. Therefore, the requested 
probabilities are given by: 


P{X > 12} = 0.5000 — 0.4901 = 0.0099 ~ 1% , 
P{—2 < X < 12} = 1—2 x 0.0099 = 0.9802 ~ 98% . 


Exercise 3.8 
Find the extremes of the Gaussian probability interval of 95% centred around 
the mean. 


Answer We have to evaluate a real number f such that P{u—to < X <uw+ 
to} = 0.95. As an alternative to Table E.1, we use, in this case, the R routine 
qnorm, which gives the Gaussian quantile values. To obtain the solution, 
we must enter the command qnorm(0.5+0.5*0.95), which returns the 
value 1.9599. In conclusion, the answer is: a 95% probability is obtained by 
integration of the Gaussian in an interval of width +1.96 o centred around the 
mean. Indeed: 


x— pt 


(HIM = 


| implies RS ffi se IQow 


and hence: 


1 U+1.960 _il ea 
— | e 2\2) dx=0.95. 
V 210 J u-1.960 


3.6 Central Limit Theorem and Universality of the Gaussian 
Curve 


The Gaussian density appears, so far, to be only a limiting distribution of the 
binomial density for high number of successes. However, a fundamental theorem of 
probability theory, proposed by Gauss and Laplace in 1809-1812, and demonstrated 


92 3 Basic Probability Theory 


in a general way in the early 1900s, assigns a crucial role to it, both for continuous 
and discrete variables, as follows. 


Theorem 3.1 (Central Limit) Consider a random variable Y which is a linear 
combination of N random variables X;: 


N 
Yy = > aX : (3.45) 
i=1 


where a; are constant coefficients. If: 


(a) Xj are mutually independent (see Eq. (2.9)); 
(b) Xj have finite variance; 
(c) all variances (or standard deviations) have the same order of magnitude: 


o[Xj] 
o[Xj] 


= O(1) for all i,j; (3.46) 


then, for N — ©, the random variable Yn converges in law, according to 
Eq. (2.75), towards the Gaussian distribution. 
Therefore, we can write, using cumulative functions: 


{aoe <x} N>I (x) 
olYn] — ; 


where ®(x) is the cumulative function of the standard Gaussian (3.43). 


Proof The proof of the theorem, for variables having the same distribution, is based 
on generating functions and is reported in Appendix C. Also the de Moivre-Laplace 
formula (3.27) could be seen as a special case of the theorem for the sum of 
Bernoulli variables. 

More generally, it can also be shown that this theorem holds for sums of variables, 
both discrete and continuous, each having a different distribution [Cra5 1]. However, 
it is essential for the variables X; to be mutually independent (condition (a)) and, 
if taken individually, to have a weak influence on the final result (conditions (b) 
and (c)). Moreover, a very important and rather astonishing fact, which occurs in 
practice, should be immediately noted: the condition N — oo can be replaced, in 
most cases, with a good approximation by the condition N > 10. 


In short, the theorem states that a random variable tends to have Gauss density 
if it is the linear superposition of several independent variables which, if taken 
individually, have weak influence on the final result. 

Speaking somewhat freely, we can say that the theorem assigns to the Gaussian 
the role of a universal density function for variables of systems in “high statistical 
equilibrium”. 


3.6 Central Limit Theorem and Universality of the Gaussian Curve 93 


Even more interesting is the fact that practice and computer simulations show 
that N does not have to be very big. As already highlighted in the proof of the 
theorem, it is often legitimate to use the condition N > 10. 

It is often quoted: “experimentalists use the Gaussian because they think that 
mathematicians have proved that it is the universal curve; mathematicians use it 
because they think experimenters do have in practice demonstrated its universality 
.... This somewhat simplistic statement should be replaced with the following one: 
random variables often follow the Gaussian distribution because the conditions of 
the central limit theorem are often quite well verified. However, there are several and 
important exceptions to this general rule, so that the theorem should be considered 
as a fundamental reference point, not to be blindly applied. 

Here are some examples of random variables which, according to the theorem, 
are Gaussian distributed: 


¢ The height or weight of a population of ethnically homogeneous individuals. 

¢ The weight of the beans contained in a standard can. 

¢ The values of the intelligence quotient (IQ) of a group of people. 

¢ The mean of a sample with a number of events greater than ten: in this case 
conditions (a) to (c) of the theorem are certainly satisfied, because these variables 
are independent and have the same variance. 

¢ The velocity components of the molecules of an ideal gas. 


On the contrary, the following variables are not Gaussian: 


¢ The energy of the molecules of an ideal gas: this quantity is proportional to the 
square modulus of the velocity (which is a vector with Gaussian components) 
and therefore the linearity condition (3.45) is no longer valid. We will soon 
show that the square modulus of a vector of independent Gaussian components 
follows a particular distribution, called x* or chi-square. In three dimensions, 
this distribution is the famous Maxwell distribution, well known to physicists. 

¢ As we will see shortly, the arrival times of Poissonian events do not follow the 
Gaussian distribution. 


The central limit theorem can be verified with simulated data with a few lines of R 
code: 


N=10 

for(j in 1:1000) y[j]=sum(runif (N) ) 
hist (y) 

plot (density (y,adj=0.01) ) 


where a vector of 1000 values constructed as the sum of N uniform variables U (0, 1) 
is generated. The raw data are then histogrammed with hist and also displayed in 
a different way with the density routine, which is described in Appendix B. By 
varying N from | to larger and larger values, you can check the speed of convergence 
of y towards the Gaussian distribution. 


94 3 Basic Probability Theory 
3.7 Poisson Stochastic Processes 


In stochastic processes, random variables depend on continuous or discrete param- 
eters. In this section, we will deal with a very frequent case, when time ¢ is the 
continuous parameter. 

Let’s then consider a stochastic process in which a source generates events: 


(a) Of discrete type, in such a way that at most one event can be emitted in an 
infinitesimal time interval dr. 

(b) With constant probability 4 per unit of time, equal for all the events. This 
parameter, of dimension t~!, represents the emission frequency per unit of time 
(e.g. the number of events per second). This property implies that the average 
number of events emitted in a given time interval depends only on the width of 
the interval, not on its position along the time axis. 

(c) Mutually independent. In other words, the numbers of events occurring in 
disjoint time intervals are independent random variables. 


A stochastic process following conditions (a) to (c) is called stationary Poisson 
process. 

What is the statistical law of the number of events generated within a measurable 
time interval At? The answer becomes easy when the following quantities are 
defined: 


¢ the number of attempts N = At/dt. If At is a measurable macroscopic time 
interval and df is a differential, we always have N > 1. Note that the duration 
of a useful attempt to generate an event is assumed to be a very small quantity, 
that is, the differential dt. This is the crucial hypothesis of all the arguments we 
are developing; 

¢ the probability to emit a single event in an attempt of duration dt is p = Adf. 
Then, we always have p < 1, if the emission process is discrete and dt is a 
differential; 

e the average number of events generated within At is 4 = AAt. 


The event emission process can then be considered as the binomial probability of 
having {X = n} hits in N = (At/ dt) > | trials when the elementary probability of 
a success is p = Adt < 1. Keeping in mind the results of Sect. 3.3, it is immediate 
to infer that, in a discrete process, the event counting X (Af) in an interval At is a 
Poissonian random variable, with 4 = A At. From Eq. (3.14) we obtain: 


P{X(At) =n} = pp(At) = QA oat : 


(3.47) 
It is easy to show that, if Eq. (3.47) holds, counts in disjoint time intervals are 
independent random variables (see Problem 3.13). 

Poisson’s law is followed by a truly vast class of phenomena: the number of 
fishes caught in | h in stable environmental conditions, the number of cars that stop 


3.7 Poisson Stochastic Processes 95 


at a traffic light on the same day and the same time in, let’s say, 1 month (if there 
are no exceptional events), the number of photons emitted by excited atoms, the 
number of nuclear particles emitted by a radioactive source, the number of shooting 
stars observed in 10 minutes on an August night, the number of traffic accidents in 
1 year, if no new security measures are taken on and so on, in short, all those cases 
in which a stable source (A constant) emits discrete (A dt « 1) and independent 
events. 

So far we have considered the properties of the source of events; what is the role 
of the observer or of the counting apparatus? The observation of the process does 
not alter the Poissonian statistics if the detection or counting apparatus (the eye, an 
instrument or a more complicated device) has both a short dead time and a short 
resolution time, compared to the average arrival time of the events. The dead time is 
the time during which the instrument remains inactive after the detection of an event 
(e.g. in a Geiger counter, this time is of the order of a millisecond); the resolution 
time is the time below which the apparatus no longer records all emitted event (this 
quantity is slightly less than | second for the human eye, whereas it ranges from 
a fraction to a thousandth of a second in mechanical instruments, and it can reach 
a billionth of a second or even less in electronic instruments). Basically, an event 
arriving within the dead time is not recorded, while two or more events arriving 
within the resolution time are recorded as a single event. It is clear then that the 
original Poissonian statistics is not altered if the probability of an event arriving 
within the dead time and the probability of arrival of more than one event within the 
resolution time are both negligible. 

The number of events emitted and/or counted in a finite time Af is not the only 
important random variable of a stochastic process; the time T between events is also 
a random variable. This leads us to discuss a further fundamental statistical law of 
nature: the negative exponential law of arrival times. Imagine to reset the clock at an 
arbitrary time t = 0 or on the last recorded event: the time T of the next arrival is a 
random variable which is determined by the compound probability to have nothing 
until t and to have the occurrence of an event within (t,f + df]. Setting n = 0 
in Eq. (3.47), we have po(t) = e*! | while, according to the definition of 4, the 
probability of arrival in (t,t + dt] is simply given by A dt. The p.d.f. of the random 
variable T is therefore a negative exponential: 


e(t) dt = po(t)Adt =rAe“ dt, t>0. (3.48) 


It is straightforward to prove that this distribution is normalized and that its mean 
and variance are given by: 


=e i ric Leia e! 3.49 
w= e(t) 5 OS -;) e(t) = 2° (3.49) 


96 3 Basic Probability Theory 


The discrete equivalent of the exponential density is the geometric density (3.7). It 
is indeed easy to see that, for p < 1, the logarithm of this equation becomes: 


In g(n) = In[p( — p)” 'J=np+@—1I)ind—- p) > ip-(@-Dp, 


and hence: 


gin) St pe“ De | (3.50) 


The result obtained so far suggest some interesting considerations. The intuitive 
representation that we often have of the arrival times of an event flow (e.g. 10 
events per second) is of Gaussian type: events should arrive at intervals of about 
1/10 of a second, with small symmetrical fluctuations around this value. This 
intuitive representation is completely wrong: the arrival of events is governed by 
the exponential law, according to which clusters of events are much more likely than 
events separated by long waiting intervals (see also Fig. 3.4). This effect is clearly 
visible with the particle counters that are often used in a laboratory: events seem to 
arrive “in clusters” separated by long waiting intervals, giving, to non-experts, the 
impression of instrumental malfunctions. Instead, this apparent temporal correlation 
is precisely due to the absolute lack of correlation between events! 

Rare events, such as accidents or natural disasters, often occur within short time 
periods, generating the (false) belief of mysterious correlations between disasters. 
In fact, very often these are just effects due to chance, which have suggested some 
popular sayings such as “ good things come in threes, bad things come in threes .. .” 
or “it never rains but it pours ...”. Maybe these are among the few proverbs that 
have some scientific foundation ... Since the exponential density does not have a 
Gaussian-like “bell shape”, the mean does not give much information about the data 


Fig. 3.4 In a Poisson 4 
process, the density of arrival 
times is a negative 
exponential, as shown by the 
dashed curve, which refers to 
an average flow of 0.1 
event(s) (A = 0.1s7!, ie. 1 
event every 10s). The solid 
line is a Gaussian with a 
mean of 10 s and arbitrary 4 
standard deviation of 4 s, 
which is an example of the 
intuitive (but incorrect!) idea 


that we often have of 0 10 20 30 40 50 


stochastic phenomena 


0.08 
| 


0.04 
| 


0.00 
| 


3.7 Poisson Stochastic Processes 97 


localization in this case and is more useful to use the cumulative function of the 
density (3.48) that is easily obtained from Eq. (2.33): 


t 
P{O<T <t}=F(t)= i je dei Se, (3.51) 
0 


This function gives the arrival probability of at least one event into [0, t]. The 
probability of not observing events up to a time ¢ is given by: 


P{T >t}=1-—F@) =1-—P{(0<T <th=e™ = pot), (3.52) 


which is the Poisson density (3.47) when n = 0. 
Since, from Eq. (3.49), we know that the time mean is 1/A, we can calculate the 
percentage of arrivals before and after this value: 


P{O<T <1/A}=1-—e7! =0.63, (3.53) 
P{l/aA < T} =e"! =0.37. (3.54) 


Exercise 3.9 

On average, 40 vehicles passed through a certain stretch of road between 
23:00 and 23:30. What is the probability of observing a time interval less 
than 10 s and one longer than 5 minutes between one vehicle and the next 
one? 


Answer The mean time between two consecutive vehicles is given by: 


60x 30. ~=—- 1800 
zs = = 415.5 , 


40 40 
from which a mean arrival frequency of 


tule 0.022 s—! 
45 


Xr 
is obtained. By applying Eq. (3.51) with t = 10s, one obtains: 
P{0 <T < 10s} =1—e 9°? 10 _ 9,199 ~ 20%. 


Then, the probability to observe time intervals longer than t = 5 min = 300s 
is given by: 


P{T > 300s}=1-— P{(0<T <t}=1-(l-e”“)=e™ 
= e 0022-300 — 1.271077 ~ 0.13%. 


98 3 Basic Probability Theory 


We now demonstrate that the negative exponential is the only functional form 
ensuring the temporal independence of events. Suppose that A = {T > t+ Ar} 
and B = {T > t} are the events “there are no arrivals up to tf + At” and “up to ?’, 
respectively. Obviously one has {J > t + At} C {T > t}, because if A is verified, 
the same holds for B, and P(A M B) = P(A) (the intersection of the events does 
not correspond to that of the time intervals!). The conditional probability (1.19) is 
in this case: 


P{{T >t+At}N{T > t}} _ P{T >t+ At} 
P{T >t} ~~ PIT >t} 
e-Mtt+Ar) 


in e At = PIT > At}. (3.55) 


P{T >t+At|T > t}= 


This result shows that the information on the absence of events up to ¢ has 
no influence on the arrival probability in subsequent times. The exponential law 
remains unchanged, independently of the arbitrary moment ¢t in which the clock 
to reset; it therefore describes systems without memory.' The independence among 
counts in separate time intervals also implies the independence of events “there is 
an arrival at time ¢” and “there are no arrivals in (t, t+ At)’. Therefore, the relation: 


P{T >t+At|T =t} = P{T > At}=e*™ 


holds, which shows that the times between arrivals (interarrival times) are indepen- 
dent random variables of negative exponential p.d.f. 

To describe biological systems or devices with memory, which tend to “age”, a 
variant of the exponential function, called Weibull density, is sometimes used: 

w(t) =at?e-@/? tO, a, b> 0, (3.56) 
where the parameters a and b are often empirically determined from data analysis. 
The Weibull curve, fora = b = 1, is the negative exponential; instead it tends 
to assume a bell shape for b > 1. The corresponding cumulative function is 
F(t) = ie w(t) dt = 1— exp(—at?/b). If, for example, a = 2, b = 2, then w(t) = 
2t exp(—t?) and from Eq. (3.55) one has P{T > t + At|t} = exp[—(At)? — 2t Ar]. 
The probability of having no events up to f + Af (i.e. the survival probability, if the 
event is the system breakdown) decreases when ¢ increases. 

We now generalize the exponential law to non-contiguous Poissonian events and 
determine the probability distribution of the the waiting time until the arrival of the 
k-th event after the clock start (see Fig. 3.5). 


'Tf we combine several systems without memory, we can have systems with memory, whose 
time probability densities do not satisfy Eq. (3.55); see, for instance, in the next two chapters, 
Exercise 4.1 and Problem 5.8. 


3.7 Poisson Stochastic Processes 99 


k-1 events | 


1 2 k 1 2 Leaetadaee 
start 
time start 
time 


Fig. 3.5 Definition of k-th event of the gamma p.d_f., for the arrival times of non-contiguous events 


0.4 


E(t) | 


0.35 


0.3 


0.25 


0.2 


0.15 


0.1 


0.05 


Fig. 3.6 Gamma density for A = 1| and different k values 


After inserting px—,(t) in Eq. (3.48) instead of po(t), we easily get the Erlangian 
density of order k or gamma density of Fig. 3.6: 


atyk7} ak 
(At) ht thle ts 0, k= 130; (3.57) 


ex(t) =A 


C—O T®é 


100 3 Basic Probability Theory 


where I"(k) = (k — 1)! for integer k values (see, further on, Eq. (3.65)). The mean 
and variance of this density are given by: 


ae = (3.58) 
n= 53 oOo ==. 


For contiguous events, k — | = 0 and we again obtain Eqs. (3.48-3.49). When k 
increases, the mean and variance increase and the flux density A; = 1/ decreases. 
The gamma density can also be considered as the distribution of the sum of k 
independent negative exponential random variables. This property is evident from 
Fig. 3.6 where, thanks to the central limit Theorem 3.1, we see that the curve, for 
large k, tends to the Gaussian form. These results can also be derived from the 
theory of functions of random variables, which we will develop in Chap.5 (see 
Problem 5.5). 

In R, the Erlangian family is given by the function dgamma(t, shape, 
rate), where shape = k and rate = dA; if k = 1 one gets the negative 
exponential. As usual, by changing the prefix d to p, q or r, one obtains, the 
cumulative, quantile and simulated values of the distributions, respectively. 

It is also easy to demonstrate that, between the Poisson and Erlang densities, the 
useful relation: 


= ey (t) — epi (2) (3.59) 


dpx(t) 
dt 


holds. 

Up to now, we have demonstrated that conditions (a) to (c), introduced at the 
beginning of the paragraph, which define the stationary Poisson process, imply 
Eq. (3.47), that is, the negative exponential density of arrival times. We will 
now show that the inverse is also true: Eq. (3.47) holds if interarrival times are 
independent and follow the negative exponential law. In fact, from the series 
expansion of Eqs. (3.51, 3.52), it follows that the probability to observe an event 
in an infinitesimal interval is 4 dt and that to observe more events is negligible (an 
infinitesimal of higher order). Moreover, if k events occur within ft, it means that ¢ 
is within 7, and 7+, arrival times of the k-th and (&k + 1)-th events, respectively. 
Since P{T, < t} = P{Tk <t < Tri} + P{T41 < t}, we can write: 


P{X(t) =k} = P{Te < t < Thi} 


= P{T, St} — P{Tk+i St} 
t 
= i lex(t) — ex41(t)] dt = px(t) , 


where, in the last step, Eq. (3.59) has been used. In this proof we have used the 
property that the Erlang distribution e;,(t) can be also obtained as the distribution 
of the sum of k exponentially distributed random times, without explicitly invoking 
the Poisson distribution (see Problem 5.5 and Appendix C). 


3.8 x? Density 101 


Having thus rediscovered Poisson’s law, which implies the independence of 
counts in disjoint time intervals, ultimately we arrive at the following: 


Theorem 3.2 (Stochastic independence) A necessary and sufficient condition for 
a stationary Poisson process is to have independent interarrival times and a negative 
exponential p.d.f. 


3.8 x7 Density 


What is the p.d.f. of the square modulus of a vector with random normal compo- 
nents? This problem requires the determination of the p.d.f. of a random variable 
which is a function of other random variables, a topic that will be analysed in 
detail in Chap. 5. However, here we anticipate a specific result, connected to the x* 
density, which plays an important role within the one-dimensional density functions, 
which we are describing here. 

We then want to determine the p.d.f. of the variable: 


=) x; (3.60) 
i=l 


which is the sum of squares of n mutually independent standard normal variables. 
We immediately note that this discussion is valid, in general, for any independent 
Gaussian variables, because a normal variable can always be standardized through 
the transformation (3.37). 

We indicate with Fg (q) the cumulative of the p.d.f. we are searching for: it gives 
the probability P{Q < q} that the variable Q is within a hypersphere of radius 
./q. Since the variables X; come from the standard Gaussian density (3.42) and are 
independent, we deduce, from Eqs. (1.23, 1.24, 2.9), that P{Q < q} will be given by 
the product of the compound probability to obtain a set of values (x1, x2,..., Xn) 
summed (or integrated) over the set obeying the rule )°, x < q. Therefore, we 
have: 


1 


n x2 
P {>> x? < a} =Fo(q) =[ Ss) []e* dx, dx2... dx, 


1 \* xy 
-[ . f(z) e Lit dxjdx....dx,, (3.61) 
Xjp<q IT 


where the product in the integrand follows from the compound probabilities theorem 
and from Eqs. (1.23, 1.24). We also note that the link between Q and the variables 
X; shows up only in the definition of the integration domain. If we then pass to 


spherical coordinates by setting ,/>°; a. = r, the integrand is angle independent 


102 3 Basic Probability Theory 


and the functional link (3.60) gives rise to an integration over the radius of the 
hypersphere. By a known result of differential geometry, we know that the element 
of a volume integrated over the angles becomes, from a three-dimensional sphere to 
a n-dimensional hypersphere: 


/ av = | r” sin 6 dr d6 dé = 4xr? dr > Dr"! dr, 
dQ d2 


where D is a constant that can be obtained by integration over the angular variables. 
The integral of Eq. (3.61) then becomes: 


3 
r2 


Vd i 
Fo(q) = rf e zr” dr, 
0 


where Fo includes all the constant factors present in the calculation. If we now 
operate the change of variable: 


and collect again all constant factors in Fo, we can write Fg (q) as: 


q n 
Fog) =F f e 8g?! dq, (3.62) 
0 


which is the primitive of the p.d.f. we are searching for. Therefore, we have, from 
Eq. (2.35): 


_ dFo(q) _ 


Foe tq? (3.63) 
dq 


Pq) 


Obviously, this function is defined only for g > 0. The constant Fo is evaluated 
from the normalization condition: 


[o,@) q it 
Fo i e? me dg=1. 
0 
The integral can be calculated from the gamma function definition: 


lo.) 
l(p) -| Pole de, (3.64) 
0 


3.8 x? Density 103 


where I"(p), called gamma function, can be obtained via integration by parts from 
Eq. (3.64): 


Piet 
P(/2) =/a 
I'(p+1)=plI(p) (3.65) 


I'(p+1) =p! for integer p 


I(p+1) =p(p — 1)(p—2)... (5) (5) /x for half-integer p, 


and in R is calculated by the routine gamma (x) with x > 0. From Eqs. (3.63—3.64), 
we finally obtain the function: 


Pn(q) dg = e 24 qua dq, g=0, (3.66) 


which is the p.d.f. of the square modulus of a vector with n independent Gaussian 
components. If the linearly independent components were equal to v < n, then in 
Eq. (3.60) the sum can be transformed into a linear combination of the squares of v 
independent Gaussian variables and the density function is always given by (3.66), 
after performing the n — v substitution. The linearly independent variables of a x7 
distribution are called degrees of freedom. 

In the international physics literature, the notation x* is almost always used 
to indicate the distribution, the random variable and its numerical realizations (!). 
Since, for us, this seems to be an excessive ease, to remain (partially) consistent 
with our own notation, we will indicate with Q a random variable having x 
density (3.66) and with x7 (and not with g) the numerical values of Q obtained 
experimentally. We will then write that Q(v) ~ x7(v) takes values x?. 

With this notation, the density (3.66) of the variable Q with v degrees of freedom 
is written as: 


pyr 16 F dy? , (3.67) 


(x?) dx? = p(x’; v) dx? = 
Py P 2Er() 


which, based on Eqs. (3.64, 3.65), has mean and variance given by: 


= - oo 
“2Te@ [ x(x)2-°e 2 dx =v, (3.68) 
Var[ Q] ne [ve — py? (x)? e7? dx =2v. (3.69) 


104 3 Basic Probability Theory 


Sometimes, the reduced x distribution of the variable: 


Cr eee (3.70) 


Vv 


is used, which, from Eqs. (2.63, 2.64, 3.68, 3.69), has mean and variance given by: 
2 
(Qr(v)) = 1, VarlQr(v)] = 3 (3.71) 


In Table E.3 of Appendix E, the values of the integral of the reduced x7 density: 


v 
p2 


oe} 
> y2 = : 371 (-—) : 
P{QR(v) = xR} how FG) x2" exp(—=-) oe (3.72) 
are reported (see Fig. 3.7); they are obtained by applying Eq. (3.70) to Eq. (3.67). 
This table provides the probability of exceeding an assigned value of pee a 
parameter which will be extensively used later in statistics. 
We can summarize these results, found by Helmert in 1876 and generalized by 
Pearson in 1900, with the 


Fig. 3.7 Reduced x? distribution for some degrees of freedom v 


3.8 x? Density 105 


Theorem 3.3 (of Pearson) The sum of squares of v independent Gaussian vari- 
ables is a random variable with density (3.67) with v degrees of freedom, called 
x7). 

In the following, also this theorem will be useful: 


Theorem 3.4 (Additivity of the Variable x7) Jf Q, and Q2 are two independent 
random variables, having x7 density with v; and v2 degrees of freedom, respec- 
tively, the variable 


Q=Q1+Q2 (3.73) 


has x*(v) density with v = v1 + v2 degrees of freedom: Q ~ x7(v1 + v2). Moreover, 
if Q~ x°(v) and Qi ~ x°(v1), then Q2 ~ x*(v — v1). 


Proof The proof of the first part of the theorem is immediate and can be seen as 
a lemma of Pearson’s Theorem 3.3: since the sum of squares of v; independent 
standard variables is distributed as x’, if other independent v2 variables are added to 
them, using Eq. (3.73), the result is a sum of squares of v; + v2 independent standard 
Gaussian variables, hence the statement. The proof of the second part of theorem is 
also easy if one uses the generating functions of Appendix C and Eq. (C.12). 

It is important to remember that the theorem applies to variables Q ~ x7, not to the 
reduced ones Or ~ x2 of Eq. (3.70). 


In R, the probabilities of the x* distribution with df degrees of freedom for a 
value x are calculated by the dchisq(x, df) function, whereas the cumulative, 
quantile and simulated values are obtained from pchisq, qchisq and rchisq. 
In the next exercises, we will realize that the x2 density is of fundamental 
importance both in statistics and in physics. 


Exercise 3.10 
Find the p.d.f. of the modulus R of a three-dimensional vector having 


independent Gaussian components (X,Y,Z) of zero mean and variance o”. 


Answer Let us assume to have Gaussian standard components with unit 
variance. The p.d.f. of the square modulus of this vector is then given by 
Eq. (3.66) with n = 3: 


1 
oe q? dq : 


1 
dg = 
p3(q) dq an 


since (3/2) = [(1/2+ 1) = /7/2, from Eqs. (3.65). This density gives 
the probability that the square modulus is within (¢g,q + dq). To have the 


(continued) 


106 


Exercise 3.10 (continued) 
density of the modulus (not of its square), we must use the transformation: 


ga@+ytH%ar, r=/@?+y?+2)= 7/9, 2Irdr=dq 


D eZ 
m(r)dr =,/—r’e 7 dr. (3.74) 
ue 


So far we have considered the variables (X, Y, Z) as standard Gaussians. 
However, we know that R is the modulus of a vector with non-standard 
Gaussian components having a finite variance o7. To take into account the 
non-unit variance, we operate the transformation: 


to obtain: 


which redefines r as: 


dr 


ee 
r>— dr-—. 
os o 


By inserting this transformation in Eq. (3.74), we obtain: 


2 Il a2 
m(r)dr = | -— r°e 2 dr, (3.75) 
uo 
which is the well-known Maxwell density function. 


Since Roles is a ne variable with 3 degrees of freedom (i.e. KA), from 
Eqs. (2.64, 3.68, 3.69), the mean and variance are given by: 


so that: 


(R°) =302: VarlR2]= 60°. (3.76) 


3 Basic Probability Theory 


3.8 x? Density 


Exercise 3.11 
Find the energy density of the molecules of an ideal gas at the absolute 
temperature T. 


Answer In an ideal gas, the velocity components (V,, Vy, Vz) of the 
molecules are random variables that satisfy the conditions of the central 
limit Theorem 3.1. Therefore, the velocity modulus follows the Maxwellian 
distribution (3.75). If we consider the relation between kinetic energy and 
velocity of a molecule of mass m: 


1 [2E 
E=<=mv?, v=,/—, mvdv=dE, 
2 m 


Pe il ak te 2 
ey Or eae 
J/ioimVm 
If we now use the known thermodynamics equation that links variance and 
temperature: 


we can write: 


mo’ = KT, c= |———< (3.77) 


where K is the Boltzmann constant, we finally obtain: 


GS = eas (3.78) 
m ——— ‘ z 
Mig ae |) (Ge 


which is the famous Boltzmann distribution. 

From Eqs. (3.76, 3.77), another thermodynamics fundamental result is 
obtained, that is, the link between temperature and the molecule mean velocity 
(U) at the absolute temperature T: 


(a= sm (v?) ee AiG 


At this point we would like to notice that we have obtained, both in this 
exercise and in the previous one, some fundamental results of statistical 
physics only using, as physics hypotheses, the central limit theorem and 
Eqs. (3.77). 


107 


108 3 Basic Probability Theory 
3.9 Uniform Density 


A continuous random variable X, assuming values in the finite interval [a, b], is 
defined as uniform in [a, b] and is indicated as X ~ U(a, b) when it has a constant 
p.d.f., which is called uniform or flat density (see Fig. 3.8), given by: 


1 
igjelgag 7 =o" (3.79) 


0 forx<a,x>b 


The normalization condition: 
b 
Plas x <= | u(x) dx = | 


a 


is satisfied, and the mean and variance are given by: 


1 b b+a 
in — | x dx ae (3.80) 
1 f° b+a\’ (b— ay? 
2 
= _ dx = —_—., 81 
o —/ (x 5 ) x D (3.81) 


where the last integral is easily calculated, via the substitution y = x — (b+ a)/2, 
dy = dx, between the limits —(b — a)/2 and (b — a)/2. 


a b BS 


Fig. 3.8 Uniform density u(x) and corresponding cumulative function F(x) for a random variable 
assuming values in [a, b] 


3.9 Uniform Density 109 


The density is often considered uniform within the interval a = 0 and b = A; in 
this case the mean and variance are given by: 


A aes 
=—, o=——, 
ae) 12 


(3.82) 
The support of the density is given by uw — A/2 < x < w+ A/2 and its standard 
deviation is 0 = A//12. For a random variable a < X < b, having uniform 
density, the localization probability in (x1, x2) is proportional to the width of the 
interval: 


X42 — X1 


b-a 


1 m2 
P{x, < X < x2} = —— i dx = (3.83) 
b—a Jx, 


Conversely, if a continuous random variable satisfies Eq. (3.83), then it follows the 
uniform density. 

We now present a very simple, but extremely important and general theorem 
related to this distribution. 


Theorem 3.5 (Cumulative Random Variables) Jf X is a random variable with 
continuous density p(x), the cumulative random variable C: 


x 
C(X) = p(x) dx (3.84) 


is uniform in [0, 1], that is, C ~ U(O, 1). 


Proof The probability for X to be within [x 1, x2] coincides with the probability for 
the cumulative random variable C to be in the range [cj = C(x1),co2 = C(x2)]. 
From Eq. (2.33), we then have: 


x2 
P{cy < C Sco} = P{x) < X < x9} -|/ p(x) dx 


x1 


= [ p(x) dx — ie p(x)dx =c2-¢1, (3.85) 


CO 


which implies that the cumulative variable C ~ U (0, 1), since it satisfies Eq. (3.83) 
with (b —a) = 1. 


You should fully realize the conceptual and practical importance of the theorem: the 
cumulative variable is always uniform, whatever the origin distribution is. 

If the integral (3.84) is known analytically, then the values of the cumulative 
variable C can be written as c = F(x). If this function is also invertible, then the 
variable: 


X =F '(C) (3.86) 


110 3 Basic Probability Theory 


has density p(x). If you have a uniform variable generator in [0, 1], a roulette or a 
computer random number generator, continuous variables with any density can be 
generated using the equation: 


X = F~'(random) . (3.87) 


The extension of Theorem 3.5 to discrete variables requires a minimum of attention 
and the use of Eq. (2.28). If C is a uniform variable, keeping in mind that F(x) is 
defined in [0, 1], from (3.83), we obtain: 


P{X = xx} = P{xp-1 < X < xx} 
= F(xg) — F(xg-1) = P{F(xg-1) < C < F(xx)}. (3.88) 


The discrete equivalent of Eq. (3.87), if one has a random generator, then becomes: 


{X =x} if {F(xe-1) < random < F(x;,)} Gf k=1, F(xo) =0). 
(3.89) 


All Monte Carlo simulation methods that will be examined in detail in Chap. 8 
are based on Eqs. (3.87, 3.89). Equation (3.86) has a convincing graphical inter- 
pretation, reported in Fig.3.9. Consider the probabilities P{x; < X < x2} and 
P{x3 < X < x4} defined within the values [x1, x2] and [x3, x4]: they are represented 
by the areas subtended by the density function p(x) and displayed by the shaded 
zones of Fig.3.9. The area between [x , x2] (corresponding to less probable X 
values) is smaller than the area between [x3, x4] (more probable X values). By 
construction, these two areas are identical to the length of the intervals [c), c2] and 
[c3, c4], obtained from the cumulative function F(x). Let’s now consider a variable 
C ~ U(0, 1) on the cumulative ordinate axis: it will fall more frequently in [c3, c4] 
rather than in [c1, c2], with probabilities exactly coinciding with the width of these 
intervals. Therefore, if, given a value co assumed by C, we find (graphically or 
analytically) the corresponding value xo and repeat this procedure several times, we 
will obtain a sample of X values from p(x). 

We can easily verify the theorem by generating cumulative variables with R. For 
example, we can consider the y* density and write: 


y <- pchisgq(rchisq(1000,df=10) ,df=10) 
hist (y) 


to check that the histogram of y does follow the uniform distribution. It is also 
interesting to see that if one writes: 


y <- pchisg(rchisq(1000,df=9) ,df=10) 
hist (y) 


a non-uniform distribution is obtained for y, because 1000 variables have been 
generated from a x~ distribution with 9 degrees of freedom (the value of the upper 


3.9 Uniform Density 111 


Fig. 3.9 Graphical representation of the cumulative variable theorem. The functions p(x) and 
F(x) are a generic p.d.f. and the corresponding cumulative, respectively 


limit X in Eq. (3.84)), whereas the cumulative is calculated from a different p.d_f. 
p(x) having df=10. Indeed, the theorem is valid if and only if in Eq. (3.84) X is 
sampled from p(x). In the following, we often will use this property to check the 
parent distribution of some variables. 


Exercise 3.12 
Assuming to have a computer-generated random value uniformly distributed 
in [0, 1], randomly sample arrival times of stochastically independent events. 


Answer The arrival time distribution of stochastically independent events is a 
negative exponential with cumulative function of an arrival in [0, ft] given by 


Eq. (3.51). Given Theorem 3.5, this function is a uniform random variable. 
We then have: 


random=1—e™. 
By inverting this equation, we get: 


1 
— = In(l — random). (3.90) 


(continued) 


112 3 Basic Probability Theory 


Exercise 3.12 (continued) 
Since, if random ~ U(0,1), the same also holds for (1-random), this 
equation is equivalent to: 


1 
— ae In(random) . (3.91) 


Furthermore, it is easy to see that a vector of 1000 values 

y <- rgamma(1000,shape=1) 

obtained with the R routine for the generation of exponential variates 
with A = 1 has a distribution identical to that of a vector z <- 
-log (runif (1000) ), according to Eq. (3.91). Equations (3.90, 3.91) are 
an example of the general Eqs. (3.86, 3.87). 


3.10 Chebyshev’s Inequality 


The standard deviation, which is an index of the dispersion of a variable around its 
mean value, satisfies an important and general property. Let us consider a p.d-f. of 
mean /, finite variance o*, and the interval [u— Ko, 1+Ko], where K is a positive 
real number. Obviously, the points outside this range are defined by the condition 
|x — | > Ko. Considering the expression (2.57), of the variance for continuous 
variables, we can write: 


+00 
gs i (x — 4)" p(x) dx 


u+Ko i a 

= / (x — p)?p(x) dx +f (x — pw)’ p(x) dx 
u-Ko |x-—p|>Ko 

> / (x — p)2 p(x) dx > Ko? | p(x) dx 
|x-—p|>Ko |x-—p|>Ko 


u+Ko 
= K’o’ (-f p(x) ax) 
u-—Ko 


From the last equality, we obtain: 


u+Ko 1 
/ 7 p(x) dx => 1— 2’ (3.92) 
U—Ko 


3.11 How to Use Probability Calculus 113 


which is known as Chebyshev’s inequality. This relation can be easily proved even 
for discrete variables and is quite general, because the only condition that has been 
imposed on p(x) is to have a finite variance. Let us now see what information is 
present in this general law. Similar to Eq. (3.35), this inequality assumes the form: 


utKo 0 forkK=1 
P(X=l<Ko)= [ p(x) dx > 1- —, = 4 0.75 for K =2 
ve ‘s 0.89 for K =3, 
(3.93) 


which shows that intervals around the mean of width 20 and 30 cover at least 75% 
and 90% of the total occurrence probability. 

Equation (3.93) sometimes justifies the approximated 30 law, which consists in 
considering probabilities outside [jz — 30, 4 +30] to be negligible for any statistical 
distribution. Generally this is a good approximation, because Chebyshev’s inequal- 
ity, which predicts no more than a 10% probability outside of +30, is almost always 
a significant overestimate of the actual values. For example, the Gauss density has 
only 0.3% of the values “outside 30”, while, in the case of uniform density, all 
values are included within +20 (check as exercise). 

Considering only values within 30 is very common, and, generally, this leads, 
as previously mentioned, to acceptable results. However, in special cases, if one is 
dealing with “very broad” densities, it is good to remember that with this method a 
significant error, up to 10%, can occur, as shown by Eq. (3.93). 


3.11 How to Use Probability Calculus 


Figure 3.10 and Table 3.1 report the fundamental probability distributions that have 
been so far obtained. The starting point is the binomial distribution of Eq. (2.29), 
for discrete and independent random events generated with constant probability. 

The limit fornp, n(1 — p) > 1| (in practice np, n(1 — p) > 10), where n is the 
number of attempts, leads to the Gaussian density, whereas the limit n >> 1, p< 1 
(in practice n > 10, p < 0.1) leads to the Poissonian density. This latter density, 
when np > 10 (and therefore also n(1 — p) > 10, since p < 1), also evolves 
towards the Gaussian density while maintaining always the relation (3.16) u = 07, 
typical of Poissonian processes. 

We have also seen that the Gaussian and Poissonian densities, far from being 
only limiting cases of the binomial density, are the reference distributions of many 
important natural phenomena. The Gaussian distribution is the limiting density of 
the linear superposition of independent random variables, none of which prevails 
over the others. This result comes from the central limit Theorem 3.1. The statistical 
distribution of the square modulus of a vector of Gaussian components is the x* 
density, which in three dimensions is called Maxwell’s density or Maxwellian. 


114 3 Basic Probability Theory 


np, n(1—p)>>1 
BINOMIAL (in practice np, n(1—p)>10) 
n>>l, p<<l 
(in practice GAUSSIAN 
n>10, p<0.1) 


p>>l 
(in practice p > 10) 


POISSONIAN |< _ 
Gaussian 


times between 


Poissonian events vectors 


NEGATIVE A 
EXPONENTIAL x 


times between 


k Poissonian events 
GAMMA / 


cumulative 
variables 


Central Limit 
Theorem 


UNIFORM 


Fig. 3.10 The fundamental distributions of probability theory and the connections among them 


The Poisson distribution is instead the universal distribution of the number 
of independent events generated discretely with constant probability over time 
(stochastic generation). The arrival times between these events also follow universal 
distributions, the negative exponential and the gamma or Erlangian ones. 

Finally, the uniform density is also of general importance, because, as shown in 
Theorem 3.5, it is the distribution of all the cumulative random variables. As we will 
see, this principle is the basis of Monte Carlo simulations. 

It should be already clear to the reader, and it will be anyway more and more so in 
the following, that the statistical distributions deduced in probability theory provide 
a specific and coherent scheme for the interpretation of a remarkable collection of 


3.11 How to Use Probability Calculus 115 


Table 3.1 The fundamental p.d.f. of probability theory 


Name Standard deviation | Comment 


Binomial 


Sy *(— py’ Number of 


successes in 
independent trials 
with constant 
probability 


Linear 
combination of 
independent 
variables 


Gaussian 


Modulus of a 
Gaussian vector 


Chi-square 


Poissonian Counts 


Arrival times 
between 
Poissonian events 


Exponential 


Gamma Sum of k negative 
exponential 


variables 


Cumulative 
variables 


Uniform 


natural phenomena. We will now try to better explain this point, with a series of 
examples and some insights. 

Intuition alone is not enough to interpret random phenomena: if we toss a coin 
1000 times, we guess that on average we will have 500 heads (or tails), but if we get 
450 heads and 550 tails how do we know if the result is compatible with chance or 
if instead the coin is rigged or the flips were not regular? Using probability calculus, 
we can approach these problems in a quantitative way, using continuous or discrete 
p.d.f. and solving sums (for discrete variables) or integrals (for continuous variables) 
of the type: 


b b 
Pla<X<b}=) p> ik p(x) dx , (3.94) 


which gives the probability level of the result, that is, the probability to obtain the 
observed values within the interval [a, b], if the model given by p(x) is valid. 

It is therefore possible to judge whether a certain deviation from the expected 
value is due to chance or not. As we have already mentioned in Sect. 2.6, in the 
first case it is said that there is a normal statistical fluctuation and that the model 


116 3 Basic Probability Theory 


Fig. 3.11 Graphical P{X 9S X<Xpl=P-a 
representation of the B 
probability level B — a p(x) 


represented by p(x) is valid (or better, that is not falsified by observation); in the 
second case, the result is interpreted as a significant deviation and the model is 
rejected. In this type of studies, the quantile values (2.18) or the mean and standard 
deviation of the model distribution can be used. In this way, analysis is faster 
and allows us to acquire useful mental automatisms, easy to remember by heart, 
which also permits to evaluate, quickly and correctly, the experimental results. If 
the quantile values of the distribution are known, by setting a = x, and b = xg, the 
probability interval, as also shown in Fig. 3.11, becomes: 


Pity SX Sxpgh=B-a. (3.95) 


If X has a symmetric p.d_-f., the following notation: 


P {=a < e < nan} =l-a. (3.96) 


is often used. In many cases, ty is the standard normal quantile (easily obtained 
from Table E.1), or evaluated from Student’s density (which we will discuss in the 
following). In several text books, Gaussian quantiles are indicated as zy. 

In R, it is easy to calculate Eq. (3.95) using cumulative functions (see Table B.2). 
For example, if you want to evaluate the area between the values xy = —1.5 
and x, = 2 of the standard Gaussian, just write pnorm(2) -pnorm(-1.5), 
obtaining the value 0.91044. Levels commonly used in statistics are 1 —a@ = 
0.90, 0.95, 0.99, 0.999, which correspond to the Gaussian quantile values tj-9/2 = 
1.64, 1.96, 2.58, 3.29. Alternatively, as we will often do in this text, one can adopt 
the convention, common among physicists, that parameterizes probability intervals 
according to the 30 law (3.35). Standard deviation can be considered as the 
universal unit of measurement of statistical fluctuations. To calculate its value, one 
should not proceed intuitively but use probability calculus. So, does a result of 450 


3.11 How to Use Probability Calculus 117 


hits in 1000 flips only represent a statistical fluctuation or a significant deviation 
from the expected value of 500? The answer is in the example below. 


Exercise 3.13 
A coin was flipped 1000 times and 450 heads were obtained. Is this result 
compatible with the hypothesis of random flipping of a non-rigged coin? 


Answer The model taken as a reference, which in statistics will be called null 
hypothesis, predicts that the a priori probability of obtaining head in a single 
roll is p = 0.50 and that the probabilities of the possible values of a thousand 
flips, ranging from 0 to 1000, can be calculated from the binomial distribution 
with mean and standard deviation given by Eq. (3.6): 


fw =np =500, 
o = Vnp( — p) = V500-0.5 = V250 = 15.8. 


The observed frequency, that is, the experimental result, is f = 0.45, which 
corresponds to a standard value of: 


450 — 
pe ee 
15.8 


The results differ 3.16 standard deviations from the expected value. Since 
np = n(1 — p) = 1000-0.5 = 500 > 10, we can assume the Gaussian 
approximation for the standard variable, and use Table E.1 of Appendix E. 
From this table we read, in correspondence of t = 3.16, the value 0.4992. 
The area of the tail to the left of ¢ is given by: 


P{T <t = —3.16} = 0.5000 — 0.4992 = 8-107, 


which is the probability to obtain by chance, if the model holds, values < 450. 
With R, we obtain, using the command pnorm(-3.16), a value of 
0.000788. 

Now pay attention to this crucial step: if we reject the model when it is true, 
the probability to be wrong is not greater than 8 over 10.000. It is also said 
that the data agrees with the model with a significance level of 8 - 107+. 

In conclusion, since the significance level is very small, we can say, with 
a small chance of being wrong, that 450 successes on a thousand tosses 
represent an event in disagreement with the binomial model with p = 1/2, 
which assumes independent flips of a non-rigged coin. 


118 3 Basic Probability Theory 


The hypothesis is generally considered to be falsified when the observed 
significance level is below 1-5%. However, this value depends on the type of 
problem being considered, and it is (at last partially) subjective, as we will discuss 
in more detail in Sect. 7.1. For instance, let’s assume that instead of the fairness of a 
coin, we are considering the safety of an airplane. If a certain experiment results in a 
significance level of one per thousand with respect to the null hypothesis of a design 
flaw, we probably wouldn’t feel like rejecting the hypothesis and concluding that the 
aircraft is well designed. In fact, the test indicates that, in this case, one in a thousand 
aircraft could crash. Probability theory thus allows us to quantify the possibilities 
that are taken into consideration in the study of a problem, but often in the final 
decision, it is necessary to make a cost/benefit analysis to take also into account 
factors that are not strictly mathematical or statistical by nature. However, there are 
cases in which the decision is easy, because a too small significance level is reached 
to practically coincide with certainty. For example, if 420 heads are obtained in a 
thousand flips of a coin, the standard variable would be 5.1 and the probability of 
being wrong by rejecting the true hypothesis of “fair coin” would be basically zero. 

In the real experiment as shown in Table 2.2, and by Problem 2.10, 479 heads 
per thousand tosses have been obtained. This result corresponds to a value of the 
standard variable t = |479 — 500|/15.8 = 1.39, to which Table E.1 assigns a 
value of 0.4177, corresponding to a significance level of 8.2%. Here we have a first 
defined point in the analysis of the experiment of Table 2.2 (that we will reconsider 
many times in the following): the global number of heads obtained is in reasonable 
agreement with the binomial model and a priori probability of 1/2. 

In the previous exercise, the so-called one-tailed test was used. In the following 
example, the two-tailed test will be used, in which values to the left and to the right 
of a fixed interval are discarded. 


Exercise 3.14 

What is the probability to be wrong by adopting as a decision rule to define 
a fair coin (i.e. with 1/2 probability) a number of hits in a thousand flips 
between 450 and 550? 


Answer Using the data from the previous exercise, we immediately obtain the 
result: 


P{|T| > t = 3.16} = 2 (0.5000 — 0.4992) = 1.6-10°>. 
This means that about 2 fair coins are discarded out of 1000 assuming that all 


tested coins are fair. 


Referring to this last exercise, we could wonder how many bad coins are accepted. In 
other words, what is the probability to accept a false hypothesis as true? Generally, 


3.11 How to Use Probability Calculus 119 


finding this probability is not easy, because you should know the a priori probability 
of all the variables related to the problem. In the previous example, the true 
probabilities of all coins used in the test should be known, as explained in the 
following problem. 


Exercise 3.15 

In the coin stock of the previous exercise, there is one coin with a priori 
probability p = 0.6 for the face of interest. Which is the probability of 
accepting it as fair while still adopting 450 < x < 550 as the decision rule? 


Answer We have to calculate P{450 < X < 550} for a Gaussian with mean 
and variance given by: 


pL = 0.6 - 1000 = 600 


o = /1000-0.6- (1 — 0.6) = 15.5. 


By considering the two standard values: 


450 — 600 
See Ee Ge 
ep 15.5 
550 — 600 
Fc AO 
nae 15.5 


from routine pnorm, we obtain the probability: 
P{450 < X < 550} = pnorm(—3.22) — pnorm(—9.68) = 0.00064 


which corresponds to a wrong probability of acceptance of about 6 over 
10,000. 


In the examples discussed so far, the Gaussian approximation of the binomial 
distribution has always been used. Here is a different situation. 


Exercise 3.16 

In a population of 10,000 inhabitants, historical data on a rare disease give 
four cases per year. If there are ten cases in a year, is there an increase in the 
disease or is it simply a statistical fluctuation? 


(continued) 


120 3 Basic Probability Theory 


Exercise 3.16 (continued) 

Answer The most reasonable working hypothesis is to assume a Poisson 
distribution with 4 = 4 as null hypothesis. In this case, we cannot use the 
Gaussian approximation, which requires 44 > 10, and we are forced to use 
Eq. (3.94). The probability of observing, by pure chance, at least ten cases of 
illness, when the annual average is four, is then: 


i) x 9 ar 
UK Sp = = Paes Sy = 
x=10 ~~ x=0 ~" 


1 — (0.0183 + 0.0733 + 0.1465 + 0.1954 + 0.1954 
+0.1563 + 0.1042 + 0.0595 + 0.0298 + 0.0132) 
== 099198" 10. 


Alternatively, one can use the R command: x=1-ppois (9, lambda=4) 
which gives a result of x=0 .0081. 

This value represents the observed significance level of the hypothesis, that is, 
the probability to discard a true hypothesis. We can therefore reasonably say 
that we are in the presence of an increase of the disease, with a probability of 
being wrong, if the null hypothesis were correct, of about 8 per thousand. 


Finally, let us consider a very instructive example, which can be considered as the 
paradigm of the way science operates as regards the possible rejection (falsification) 
of a theory. 


Exercise 3.17 

A committee of astrologers interviews a person, without knowing its personal 
details, to try to identify its zodiac sign. At the end of the examination, three 
zodiac signs, one correct and the other two incorrect, are submitted to the 
committee. If the commission got at least 50 successes on 100 tests, could 
astrology be reasonably considered a science? 


Answer Let us assume pure chance as the null hypothesis. In this case, the 
probability for the committee to give the correct answer just by guessing it is 
equal to 1/3, for each examined person. The terms of the problem are then: 


e p.d.f.: binomial 
¢ Number of attempts: n = 100 
e Success probability at each attempt: p = 1/3 


(continued) 


3.11 How to Use Probability Calculus 121 


Exercise 3.17 (continued) 

¢ Number of successes: x = 50 

e Expected mean of the distribution: 1 = np = 33.3 
* Standard deviation: o = /np( — p) = 4.7 


Since the conditions np > 10, n(1 — p) > 10 are valid, we can use Gaussian 
probabilities to evaluate the probability level of the standard value: 


_ 50 — 33.3 
AL 


= 355 « 


The probability to get at least 50 hits is computable from Table E.1 of 
Appendix E: 


P{X > 50} = P{T > 3.55} = 0.5000 — 0.4998 ~ 2-10-47, 


or with the R command: 1-pnorm (3.55) =0.0001926. We can therefore 
reject the hypothesis of randomness, with a probability to be wrong of around 
2 over 10,000. 

If a series of results of this kind would be achieved, astrology would acquire 
scientific dignity and could well be taught in schools. However, reality is 
quite different: in 1985, the Nature magazine [Car85] reported the results of 
this test, conducted by a mixed committee of scientists and astrologers. The 
success rate, in 120 trials, was 34%. To date, no significant deviations from 
the laws of chance have been published, both for astrology and many other 
pseudo-sciences such as telepathy and clairvoyance, in any journal accredited 
by the international scientific community. In other words, we can state that, 
in this kind of experiments, the hypothesis of pure chance has never been 
falsified, that is, that astrology, telepathy and clairvoyance have no scientific 
validity. 


In these last exercises, we have discussed whether or not to accept an experimental 
value, having previously adopted a certain probabilistic model assumed as true. 
These topics, which are the subject of statistics, will be discussed in detail later, 
starting from Chap. 6. However, we think it was helpful to have a preliminary look 
at these interesting problems. 


122 3 Basic Probability Theory 
3.12 Problems 


3.1 Solve Problem 1.9 (i.e. to find probability P{X < Y} for two uniform random 
variables 0 < X, Y < 1) without using geometrical arguments. 


3.2. Random walk: a particle moves in steps and, at each step Ax, it may remain at 
rest (Ax = 0) or deviate by Ax = +1 with the same probability. Calculate, after 
500 steps, mean value and standard deviation of the path X. 


3.3 Unlike the problem above, this time the particle chooses at each step, with equal 
probability p = 1/2, the deviations Ax = —1 and Ax = +1. 


3.4 The probability to transmit a wrong bit is 10~. Calculate the probability that 
(a) a wrong bit is present in a 16-bit number and (b) the mean number of wrong bits. 


3.5 The probability that a person has the flu virus in a certain season is 20%. Find 
the probability that, in a room with 200 people, the carriers of the flu are between 
30 and 50. 


3.6 Find, form — oo, the density function of the variable Y = X,X2...Xn, where 
Xj are positive random variables satisfying the central limit theorem conditions. 


3.7 Sometimes, to describe the width of a density function, the “full width at half 
maximum?” is used, which is defined as FWHM= |x2—x |, where p(x) = p(x2) = 
Pmax/2. Find the relation between FWHM and o for a Gaussian. 


3.8 In a hospital there is, on average, one twin birth every 3 months. Assuming a 
negative exponential distribution, determine (a) the probability of not having twin 
births for at least 8 months and (b) the probability of not having twin births within 
a month if 8 months have passed without twin births. 


3.9 The sum of the squares of 10 standard variables is 7. Find the probability to be 
wrong by stating that the variables are not Gaussian and independent. 


3.10 Historical data shows 8500 deaths per year from traffic accidents. During the 
year following the introduction of seat belts, the deaths drop to 8100. Considering 
the historical data as the true average value and the annual data as a random variable, 
evaluate whether seat belts have significantly decreased the number of deaths at the 
1% level. 


3.11 If the average frequency of a Poissonian process is 100 events per second, 
calculate the fraction of time intervals between two events less than | millisecond. 


3.12 Problems 123 


3.12 A certain amount of radioactive substance emits a particle every 2 s. During a 
test performed on a sample, there were no counts for 10 s and it is concluded that 
this substance is absent. What is the probability that the conclusion is wrong? 


3.13 Prove that, if Poisson’s law (3.47) holds, the counts in disjoint time intervals 
are independent random variables. 


3.14 Find the interval, centred on the mean, that contains with 50% probability the 
values of a standard Gaussian variable. 


3.15 Find the mean and standard deviation of a Gaussian, knowing that the 
probability of obtaining values greater than 4.41 is 21% and that of obtaining values 
greater than 6.66 is 6%. 


3.16 100 events were recorded in 5 days from a Poissonian process. Calculate how 
many days, on average, must pass before recording four events in | h. 


3.17 Using Theorem 3.5, find an algorithm to generate random variables from the 
density p(x) = 2x —2,1 <x <2. 


3.18 A sample of Gaussian-distributed electrical resistances has a mean value of 
100 2, and a standard deviation of 5 2. 

(a) What is the probability that a resistance value deviates by more than 10% from 
the expected value? (b) What is the probability that 10 resistances in series have a 
value > 1050 §2? (c) What upper limit can be derived, for case (a), abandoning the 
Gaussian assumption? 


3.19 In a rare decay process, a counter can record from 0 to 3 counts with the 
following probabilities: 


x 0 1 2 3 
probability 0.1 0.4 0.4 0.1 


Considering this to be the true distribution, and using the Gaussian approximation, 
calculate the probability that in a month (30 days) a total number of counts greater 
than 80 is recorded, assuming independent daily counts. 


3.20 An electronic company manufactures a particular component with a 5% 
percentage of defective parts. After having sold several batches of 200 items, the 
company declares a maximum of 15 defective pieces. Find the percentage of batches 
that do not meet the sales specification. 


Chapter 4 ® 
Multivariate Probability Theory peels 


The non-mathematician is seized by a mysterious shuddering 
when he hears of four-dimensional things, by a feeling not 
unlike that awakened by thoughts of the occult. And yet there is 
no more common-place statement than that the world in which 
we live is a four-dimensional space-time continuum. 


Albert Einstein, “RELATIVITY, THE SPECIAL AND THE 
GENERAL THEORY; A POPULAR EXPOSITION”. 


4.1 Introduction 


We will now address topics a little more complex than those of the previous chapter. 
However, this effort will be repaid by the results we will obtain, which are necessary 
for understanding and dealing with problems involving several random variables. In 
order not to unnecessarily complicate the mathematical formalism, the problems 
will be discussed initially for the case of two random variables. Then, since the 
hypothesis of only two variables will never enter into the proofs of theorems and 
into the discussion, the obtained results will easily be extended to any number 
of variables. They will also be mainly presented in integral form, considering 
continuous variables. The transition to the case of discrete variables is immediately 
obtained with transformations like: 


b b 
[creer > YC.)p@), (4.1) 


x=a 


where (...) denotes any expression containing random variables and parameters. 
Equation (4.1) is immediately extendable also to the case of several variables. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 125 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_4 


126 4 Multivariate Probability Theory 
4.2 Multivariate Statistical Distributions 


If X and Y are two random variables defined on the same probability space, we 
define as A the event where x is within two values x; and x2, {xj < X < x2} and 
similarly the event B as {y; < Y < y2}. We write the compound probability P(A B) 
as: 


P(AB) = Pix <X S4x2,N SY Sy}, (4.2) 


which is the probability that the value (x, y) falls into [x;, x2] x [y1, y2]. A function 
p(x, y) = Osuch as: 


x2 py2 
P{x, < X <x2,y1 < Y < yo} a / p(x, y) dx dy (4.3) 
x1 YI 


is called joint probability density of the two variables. It is the bidimensional 
extension of the p.d.f. defined in Sect. 2.7. 

More generally, if A € R?, it can be shown that if p(x, y) satisfies Eq. (4.3), the 
probability that (X, Y) € A (e.g. x + y < a, with a constant) is given by [PUPO2]: 


P{(X,Y) € AS= ; p(x, y)dxdy, (4.4) 
A 


which becomes equivalent to the integral (3.94) and to the cumulative or distribution 
function (2.33) when: 


P{(X, Y) € A} = P{-o < X <a,-w<Y <b}. 


Both the normalization condition and the variable means and variances are obtained 
as the immediate generalization of the corresponding one-dimensional formulae: 


ff p@, y)dxdy =1, 
(X) = f fx p(x, y)dxdy = py , 
(Y) =f fy pl, y)dedy = py, Ce 
Var[X] = f f(« — mx)? p(x, y) dx dy = 02, 
Var[Y] = f f(y — Hy) p(x, y) dx dy = 02, 


where the integration is extended to (—0oo, +00), that is, to the whole range of exis- 
tence of the density function. As in the one-dimensional case (see Definition 2.10), 
the absolute integrability or summability is required for the existence of mean 
values. 


4.2 Multivariate Statistical Distributions 127 


In the following, multiple integration will be indicated simply with the single 
integral symbol. According to Eq. (2.9), if X and Y are stochastically independent, 
the compound probability theorem and Eqs. (1.21, 1.24, 2.9) allow us to write, for 
each pair (x1, x2) and (1, y2): 


P{xy <X Sx2,y91 SY S yo} = Pla S X < x2} Ply SY < yo}, 


x2 y2 x2 y2 
i / p(x, y) dx dy =} pute) dx f py(y) dy. 
x] YI x] y1 


Therefore, the joint density can be defined as: 


that is: 


D(x, y) = px(x) py(y) (if X and Y are independent) , (4.6) 


where px (x) and py (y) are the p.d.f. of the variables X and Y. 
It is also possible to define means and variances of combinations of variables. 
For example: 
(XY) = f xy p(x, y) dx dy = pry , 
(X+Y) = f+ y) pr, y)dxdy = uxty, 
Var[XY] = f(xy — px.y)* p(x, y) dx dy , 
Var[X + Y] = f(x +y — x+y)? p(x, y) dx dy. 


(4.7) 


It is easy to show, from the second and third of Eqs. (4.5) and from the second of 
Eqs. (4.7), that: 


(Ap ESA) rae ps (4.8) 


and that, if X and Y are stochastically independent, from the first of Eqs.(4.7) and 
from Eq. (4.6) it follows: 


(XY) = (X) (Y) . (4.9) 


So far, there does not seem to be anything new, other than the obvious generalization 
of one-dimensional formulae. However, multidimensional distributions immedi- 
ately reserve surprises, because definitions of at least three important new quantities 
between variables, i.e. marginal density, conditional density and covariance, can be 
defined. 


Definition 4.1 (Marginal Density) If X and Y are two random variables with 
density p(x, y) and A is an interval on the real axis, the marginal density px (x) 


128 4 Multivariate Probability Theory 


is defined as: 


+00 
P{X € A} =| px (x) dx =i ax | p(x, y) dy, (4.10) 
A A —co 


from which: 
+00 
Px(x) -|/ p(x, y) dy. (4.11) 
—C0O 


The marginal density py(y) of y is obtained from the previous equation by 
substituting x with y. 


It is easy to check that marginal densities are normalized: 


[ pxepar= f pee. yyaeay = 1. 


These marginal densities give the probability of events of the type {X € A} for any 
value of Y (and vice versa); they therefore represent the one-dimensional probability 
densities of the variables X and Y. The following theorem is useful to establish a 
very important property, i.e. whether two or more variables are independent. 


Theorem 4.1 (Independence of Variables) Two random variables (X, Y), of joint 
density p(x, y), are stochastically independent if and only if there are two functions 
g(x) and h(y) such that, for each x, y € R, we have: 


P(X, Y) = g@yhty). (4.12) 


Proof If (X, Y) are independent, Eq. (4.6) proves the first part of theorem with 
g(x) = px(x) and h(y) = py(y). If, instead, Eq. (4.12) holds, one has in general: 


+00 +00 
i, g(x) dx =G, i h(y) dy=H. 


—oo [o,e) 


Since p(x, y) is a normalized p.d.f., from the first of Eqs. (4.5) it results (the 
integration limits are implicit): 


H6= | go) ax fi) ay = f pr. yyardy = 1, 


It is then possible to define two normalized marginal densities: 


px(*) = [econo dy = Hg(x), py(x) = [econo dx = Gh(y), 


4.2 Multivariate Statistical Distributions 129 
from which , since HG = 1, Eq. (4.6) is obtained: 


D(x, y) = g(x) hQ) = g(x) h(y) HG = px (x) py(y). 


Oo 


At a fixed value xo, the function p(xo, y) should represent a one-dimensional 
p.d.f. of the variable Y. However, since p(xo, y) is not normalized, keeping in mind 
Eq. (4.11), we arrive to the following definition. 


Definition 4.2 (Conditional Density) If X and Y are two random variables with 
joint density p(x, y), the conditional density p(y|xo) of y for every fixed x = xo 
such that py (xo) > 0, is given by: 


P(xo, y) P(x0, y) 
= py li 4.13 
P(y|xo) oe) FS pte.) a (4.13) 


Also in this case, the conditional density p(x|y) of x with respect to y is obtained 
by exchanging the two variables in the previous formula. 


The conditional densities just defined are normalized. Indeed: 
J pony) dy _ 
J p(xo, y) dy 


It is important to note that the conditional density p(y|x) is only a function of y, 
since x is a fixed value (parameter). The same goes for p(x|y) by swapping the two 
variables. Therefore, it is wrong to write: 


P{Ye B\xX E€AS= [ [ eowaray= ff 2 1 aid (wrong!) , 
B ae 


since p(y|x) is a function of y only. To find the correct formula, it is necessary to 
refer to the definition of conditional probability given by Eq. (1.19): 


J rors) dy = 


P{X €A,Y € B} 
P{X € A} 

_ Sa Sg pO, y) dx dy 
fy dx [TS pix, y) dy 


The conditional mean and variance operators can be defined for a given fixed value. 
For example, the expected value of Y conditional on a x value is given by: 


P{Y © B|X € AS = 


+00 *2° y p(x, y)d 
(ris) = f pO pao 


4.14 
—0o Dx (x) 


where px (x) = if P(x, y) dy is constant because x is fixed. 


130 4 Multivariate Probability Theory 


The marginal and conditional densities descend from the compound probability 
Theorem 1.2. In fact, by inverting Eq. (4.13), we can write: 


P(x, y) = py(y) P(xly) = px(x) p(y|x), (4.15) 


which corresponds to the compound probability formula of Eq. (1.21) for continuous 
variables, i.e. the density of (X, Y) is given by the density of Y times the density of 
X for each fixed Y (or vice versa). When variables are independent, from Eqs. (4.6 
and 4.13), one obtains: 


Ply) = px(x), p(x) = py(y), (independent X and Y). (4.16) 


The difference between marginal and conditional densities can be well understood 
with the help of Fig. 4.1; the marginal density is obtained by simply projecting the 
function in one dimension, while the conditional density is the projection of the 
function on the y axis (i.e. the y p.d.f.) for a given x value (or vice versa). We also 
note that means and variances defined in (4.5) can be expressed using the marginal 
densities: 


Mx = f xpxcoar, by = [rr dy. (4.17) 


= / (x — wx) px(x)dx, oy = i (y — Hy)” py(y) dy. (4.18) 


Fig. 4.1 The marginal 
density py (y) of y is the 
projection of the density of 
the (x, y) points on the y axis 
for any x value. Instead, the 
function p(xo, y) is the 
projection of the density on 
the y axis for a fixed value xo 
(dashed line). This function, 
after normalization with 

Eq. (4.13), is the conditional 
density of Y for a selected 
value xo 


4.2 Multivariate Statistical Distributions 131 


Let us examine now the fourth of Eq. (4.7). Since, from Eq. (4.8), it results that the 
mean of a sum is equal to the sum of the means, we obtain: 


Var[X + ¥] = - ie p+ bur raped 
= Var[X] + Var[Y] + 2 Cov[X, Y], (4.19) 


where: 


covlx.r1= f( Hx)(y — by) p(x, y) dx dy . 


Therefore, we introduce the following definition. 


Definition 4.3 (Covariance of Two Variables) The covariance of two random 
variables X and Y is defined as: 


covtxsy1= ff f Gud = Hy) vba, ») de dy = oy (4.20) 


This parameter is a sort of “crossed” definition between the two possible 
variances o? and oe of Eqs. (4.5). It is a quadratic quantity, having as dimension 
the product of the X and Y dimensions. As an exercise, let us transcribe the general 
Definition (4.3) for the various possible cases, similarly to what has been done 
for the variance in Eqs. (2.38, 2.42, 2.67). In the case of a discrete density pyy, 
covariance becomes: 


CovlX, ¥] = )> 0 — px) — my) xv. y), (4.21) 
x y 


where the sum is extended to the full support of the two variables. Instead, the direct 
computation from a dataset must be performed on the sum of all the numerical 
realizations of the variables: 


Syy 


_ Do =156 10); (4.22) 
1 
Sxy = Nat deci —mx)(yi — my) . (4.23) 


Here the sum must be made only over one index, that is on all the observed (x;, y;) 
pairs. Notice that the sum of the deviations from the sample means must be divided 
by N — 1. The set of these pairs exhausts, for N — ov, the set given by the 
pairs (x;, yx), where the indexes i and k select the values of the support of the 
two variables. This is a formal but important point that must be kept in mind in 
order to perform correct calculations: when covariance is computed through the 


132 4 Multivariate Probability Theory 


density function, the summation is double and must be done on the probabilities of 
all possible values of the two paired variables; when covariance is calculated from 
an experimental set of data, the sum is single and must be done on all the pairs 
obtained in the sampling. The R software provides the function cov (x,y) which 
calculates the covariance from Eq. (4.23), where x and y are two vectors with the 
raw experimental data. 

Our routine CovarHisto(x, y, matfre) can be used to calculate covari- 
ance, using Eqs. (4.21), also from histograms. Here x and y are the vectors of 
the measured values and mat fre is the matrix containing the frequencies or the 
number of events. Finally, we note that we have, in operator notation: 


Cov[X, Y] = ((X — wx)(¥ — py) . (4.24) 
Also the equation: 


Cov[X, ¥] = ((X — wx)(Y — by)) = (XY — wx ¥ — wyX + Uxby) 
= (XY) — py (Y) - My (X) + UxMy 
= (XY) — uxby , (4.25) 


which corresponds to Eq. (2.51), is valid. Here we see that covariance is given by 
the mean of the product minus the product of the means. 

Covariance has particularly interesting properties that will be discussed in the 
next section. 


Exercise 4.1 
An electronic device has an exponential lifetime of mean: 


ju = 1/0 = 1000 hours 


(see Eq. 3.49). An instrument composed by two devices in parallel works 
when at least one of them is operational. How much longer is the average 
lifetime of the instrument than that of the single device? What is the 
probability that the instrument and the single device are still working after 
2000 h of functioning? 


Answer If t; and fz are the failure times of the two devices, the instrument 
will stop working at a time ¢ such that tf = max(f, f2). The probability of 
operation up to f, that is, of a failure at time rf, will be given by the compound 
probability that both the first and the second device stop working. Since the 
two devices are independent, it is sufficient to integrate over [0, t] the joint 


(continued) 


4.2 Multivariate Statistical Distributions 133 


Exercise 4.1 (continued) 
lifetime density (4.6), according to Eqs. (4.4, 4.12): 


ib fe 
P{max(T,, T2) < t} = P{T, < t, To < t} =| e,(t) dt / e2(t) dt . 
0 0 


Since the two exponential distributions e; ed e2 are identical, from Eq. (3.51) 
one immediately has the cumulative function of the failure times of the 
instrument: 


P{max(T1, To) < t} = P@) =(1-—e™")’. 


By differentiating this distribution function, we will obtain the corresponding 
p.d.f. ép(t). We therefore have: 


ent) —2ren (Lew), 


which is not a simple exponential. The average of the failure times: 
ae ea Ble 
UI) == 2K t(e “ —e )dt = ~- = =~ = 1500 hours , 
0 Bh BP 


correspond to a 50% increase in the mean life. 

Now we consider the second part of the problem. According to 
Egs. (3.51, 3.52), the probability for the single device to be working after 
to = 2000 hours is given by: 


to 
P{T >t} =1 -f[ de? dr =e = 0? = 0.135 = 13.5%. 
0 


For the instrument with the parallel devices, the same probability is given by: 


to 10) 
iPS Se i = i-f ep(t)dt = 1— 2. f (ef = e3 yds 
0 0 
EN ey = I I) OLA, 


Then, after 2000 h, the improvement in reliability is about 100%. 


134 4 Multivariate Probability Theory 
4.3 Covariance and Correlation 


The covariance of two random variables has the important property of vanishing 
when the variables are independent. In fact, in such condition, the density p(x, y) 
appearing in Eq.(4.20) can be written, according to Eq. (4.6), as the product 
Dx (x) py (y). Since px (x) and py(y) are normalized, one easily obtains: 


covlx. I= fo — 12d Hy) ple. 9) de dy 


=| = us) px(x) dx f = wy) prod dy 


=| [x pxtoyax— nf px(s) dr| | [ » prover — anf privy] 


= fy — Wy + by — by = 0. (4.26) 


From Eq. (4.19) it follows that, for mutually independent X;, the variance of a sum 
is equal to the sum of variances: 


Var » x] = Var[ Xj]. (4.27) 


Covariance therefore is a statistical indicator of correlation: the more it is different 
from zero, the larger is the correlation between variables. But how to evaluate the 
maximum degree of correlation? How to make it independent of the dataset you are 
examining? This difficulty can be overcome thanks to the following theorem: 


Theorem 4.2 (of Cauchy-Schwarz) If the variables X and Y have finite vari- 
ances, then the following inequality holds: 


| Cov[X, Y]| < o[X]o[Y]. (4.28) 


Proof Equation (4.28) can be written as: 


[((X — wx) (¥ — by} | < Vf (CX — wx)?} (P — wy)*). (4.29) 


If the centred variables Xy = X — ry and Yo = Y — jy are considered, Eq. (4.29) 
becomes: 


(Xorr? = (x3) (v3) . (4.30) 


4.3 Covariance and Correlation 135 


Let ¢ be any real number and consider the variable (tfXo — Yo). By exploiting the 
linearity properties of the mean, we easily obtain: 


O< (Xo = ¥o)*) =? (x3) 99 (Xo Yo) + (v3) 


This second degree polynomial in ¢ is always non-negative if and only if the 
discriminant remains < 0. We therefore have: 


4 (Xoo)? — 4(x$) (¥0) <0, 


which is just Eq. (4.30). Therefore, Eq. (4.28) is verified. 
In this equation the equality holds when 


((eXo — ¥o)?) =0 => Yy=tX) => Y=tX —tuy+py=aXt+b, 


that is, when there is a linear dependence between X and Y. Oo 


The definition of covariance and the Cauchy-Schwarz theorem leads intuitively to 
define the correlation coefficient between two variables as: 


= p[X.Y]= Cov[X, Y] (4.31) 
eS a ee | 
This coefficient lies between the limits: 
-1l<py <1; (4.32) 


its values are null if variables are independent—are positive if they are corre- 
lated, i.e. when one of them increases (decreases) as the other variable increases 
(decreases), or are negative (anticorrelation) if an increase of one variable tends to 
be associated with a decrease of the other one. Finally, the 9 = +1 limits occur 
when a linear relation of the type Y = aX + b exists between the two variables. In 
this case we basically have a single random variable, written in two mathematically 
different ways. 

In R, the correlation coefficient between two variables can be calculated with the 
simple command cov (x,y) /sqrt (var (x) «var (y) ), or more briefly with 
cor (x,y), where the vectors x and y contain the values of X and Y. 

We must now discuss a very delicate point: the connection between statistical 
independence, correlation and causality. According to Definition 1.10, statistical 
independence occurs when the probability of the event can be factored into the 
product of single probabilities of each variable, as in Eq. (2.9) for discrete variables 
or in Eq. (4.6) for continuous ones. For correlation, we use the following definition: 


136 4 Miultivariate Probability Theory 


Definition 4.4 (Correlation Between Variables) Two random variables are said 
to be uncorrelated when their correlation coefficient is zero; otherwise, they are said 
to be correlated. 


Obviously, this definition depends on the correlation coefficient that is used in 
data analysis. For now, let us use the correlation coefficient defined in Eq. (4.31); 
in the following (see Eq. (11.10)) we will give the more general definition of this 
coefficient. We also note that here the meaning of correlation is very technical and 
does not necessarily coincide with the meaning that common sense assigns to this 
parameter. Let us analyse this problem in detail. We immediately notice that, if 
variables are correlated (i.e. if the coefficient pyy of Eq. (4.31) is different from 
zero), then there exists some degree of dependence between them. However, the 
inverse is not true, and a simple counterexample is enough to prove this fact. Let us 
consider X = U + V and Y = U — V, where U and V are uniform variables in 
[0, 1]; X and Y are dependent because Eq. (4.12) does not hold, but their correlation 
coefficient is zero because of a vanishing covariance. Indeed, since (U + V) = 1 
and (U — V) = 0: 


Cov[X, Y] = ((X — (X))(Y — (Y))) 
=(U+V—(U+V)\(U —-V —(U—V)) 
=(U+V-NU-V)) =(U?-Vv?-U+V) = 


= (u?\ — (Vv?) — (Uy +(V) =0, 


because U and V have the same distribution. Another key point, as Eq. (4.26) shows, 
is that two statistically independent variables are also uncorrelated. We can then 
summarize our discussion as in the following: 


¢ A sufficient condition for statistical dependence is the presence of a correlation. 
e A necessary condition for statistical independence is the absence of correlation 
(Pxy = 0). 


These properties are also recapped in Fig. 4.2. 

A final important point relates to the connection between statistical correlation 
and causality. The important fact is that a statistical correlation between two 
phenomena does not imply a causation relationship between them. In fact, it is 
extremely easy to find artifacts and so-called spurious correlations: if you explore 
the web, you will easily find statistical correlations between the number of movies 
played by a well-known actor and the suicide trend, or between the increase in 
the pet sale and that of coffee pods and other similar amenities. Unfortunately, 
correlation and causation are often confused, and mistaking a statistical correlation 
for a cause-effect relationship is one of the most frequent and dangerous errors made 
by analysts. 

A cause-effect link can never be obtained with statistical analysis, but only with 
the use of specific models. 


4.3 Covariance and Correlation 137 


Fig. 4.2 Link between 
correlated uncorrelated 
correlation and statistical MBG oo 


dependence. For Gaussian 
variables, the set of 
(un)correlated variables 
coincide with the set of 
(in)dependent variables 


dependent independent 


Conversely, the study of statistical dependencies and correlations can instead be 
useful to verify models and theories. The approach must therefore be exactly the 
opposite: if a model or laboratory experiments suggest a cause-effect relationship 
producing a correlation (e.g. between the concentration of carbon dioxide in the 
atmosphere and the global Earth’s temperature), from the statistical data analysis it 
is possible to compare the expected correlation coefficients with the experimental 
ones and falsify or not this model or theory. We will discuss these techniques in 
Chap. 7. 


Exercise 4.2 
If T, U and V ~ U(0, 1), find the linear correlation coefficient between the 
variable: 


and the variables: 
Ye, Weeks V, aS. 
Also check the result with computer-simulated data. 


Answer On the basis of Eq. (3.82), for a uniform variable U the following 
properties hold: 


CU oe (vu?) = yee (wu = 1/2)°) 25. 
whereas for a pair of two uniform independent variables, one has: 
(UV) = [rumar [ sueyas = 1/4. 


(continued) 


138 


Exercise 4.2 (continued) 
From Eqs. (4.24, 4.31) one easily obtains: 


((T — 1/2)(U — 1/2)) 
| 
4 [(@ — 1/29) (w — 1/29)" 


((T —1/2)(T + V — 1)) 1 
Le aes eae ren PS oer We aie of ts” 
p 1 [((T = 1/2)?) ((r [ey py? V2 
ea) = SS) en 


[(e-1py(ere yyy? v2 


The 2D plots of the three variable pairs are shown in Fig. 4.3, where it is 
possible to see clearly the difference between two uncorrelated variables 
(X, Y), two positively correlated variables (X, Y;) and two negatively cor- 
related variables (X, Y2). 

It is possible to test these results with a simulation directly using the 
runif, cov and var routines given by R. To generate N = 20,000 uniform 
variables X, Y, Y1, Y2 and obtain their correlation coefficients, just use 
R in interactive way: 

> T= <- runif(20000); U <- runif(2000); V <- runif (20000) 
> X=T; Y=U; Y1=X+V; Y2=-X+V 
> cov (X,Y) /sqrt (var (X) «var (Y) ) 
SI = 10-6106 ie telovS SESSILIS als) joneslianeevel. 5 . 
> cov(X,Y1)/sqrt (var (X) «var (Y1) ) 
cov (X,¥2) /sqrt (var (X) «var (Y2) ) 


v 


The correlation coefficient o[X, Y] (and also the coefficients p[ X, Y,] and 
p[X, Y2]) are evaluated with Eq. (4.31). 
From this simulation, the following values have been obtained: 


r(x, y) =0.007, r(x, y1) =0.712, r(x, y2) = —0.705. 


To estimate whether the difference between these “experimental” values and 
the theoretical ones (0, 0.7071, —0.7071) is significant or not requires the 
concepts that will be developed in Chap. 6. 

A more complete calculation is performed in our routine 
CorrelEst (X,Y), which gives also the errors on covariances and 
correlation coefficients. These last parameters will be evaluated later, in 
Chap. 6. 


4 Multivariate Probability Theory 


4.4 Two-Dimensional Gaussian Distribution 139 


Fig. 4.3. 2D plots of the pairs 1 
of variables considered in 0.755 x a) 
Exercise 4.2: no correlation 
(a), positive correlation (b) 0.5): 
and negative correlation (c) 0.25; 
% 0.2 0.4 0.6 0.8 y 1 
1 as 
0.75} * My 
0.5+ 
0.25- 


AN ie n ae \ | ! 
% 0.25 05 0.75 1 1.25 15 1.75. 2 
1 


0.75} oe ee. . 

0.5} es 

0.25+ : f 
! ! I ses feseanea L 

ui -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 

2 


4.4 Two-Dimensional Gaussian Distribution 


Even in the multidimensional case, many problems can be studied, in an exact or 
approximate way, assuming normal or Gaussian variables. It is therefore essential, 
at this point, to study the two-dimensional (named also bivariate) Gaussian density. 
Firstly, we immediately notice that, based on Eqs. (3.28, 4.6), the density g(x, y) of 
two independent Gaussians variables is given by: 


1 1 — px)” — py)” 
(X,Y) = gx() By(y) = =—— exp -5 (a + oo) 
Ox Oy 


2 oz oS 


(4.33) 
With the substitution: 


X — by Vi ie 
2 - =, vee |. (4.34) 


Ox Oy 


the two-dimensional analogous of the standard Gaussian density (3.42) will then be 
given by: 


1 1 5 2 
g(u, v; 0, 1) = — exp] —=(u" + 0*)] . (4.35) 
20 2 


However, the density we want to derive must refer to two normal variables that, in 
general, are mutually dependent and, therefore, with a linear correlation coefficient 


140 4 Multivariate Probability Theory 


Puv = pe # 0. The corresponding standard Gaussian density must therefore satisfy 
the properties: 


¢ Not to be factorizable into two separate terms depending on u and v only 
¢ To have standard Gaussian marginal distributions 

¢ To satisfy the equation o,, = p 

¢ To satisfy the normalization requirement 


Let us check if there is a functional form of the type: 
gu, v5 0,1) = 5 eeu tten) | (4.36) 


which satisfies all these requirements. They correspond to the system of equations: 


l= af fee Gaul sent) ded, 
Oe en aurtbhuv+ev?) gy dv, 
_ diane dudv. 


These integrals can be solved by the method of Exercise 3.4. The result is a system 
of four equations in the four unknowns a, b,c, d. Since the detailed procedure, 
though laborious, is not particularly difficult or instructive, here we report only the 
final result. If desired, one can easily check afterwards the correctness of the result, 
which, if o ~ +1, turns out to be: 


I p 
Seat, Ge Pate tap. (4.37) 
20 — p) (— p2) . 


Equation (4.36) then becomes: 


az 


g(u, v; 0, 1) = (u* — 2puv + "| (4.38) 


1 
Dela ee P| 20 — py) 


It is easy to see that this density gives rise, as required, to two standard Gaussian 
marginal densities for any p ~ +1. In fact, by adding and subtracting the term pu? 


4.4 Two-Dimensional Gaussian Distribution 141 


in the exponent and integrating in v, one obtains: 


(v — pu)? 
(u; 0, 1) = — fe p|- a 
Be wi 2 — p) 
enw “12 ene 
> il e# d=, (4.39) 


where z = (v — pu)/1— 2, du = V1 — 2 dz and the basic integral (3.29) 
of Exercise 3.4 has been used. The same result is obtained also for the v marginal 
density. 

Returning from reduced to normal variables, from Eq. (4.38) we finally get the 
density: 


2a) = : eT BVED) , (4.40) 
20 0,0y V1 — p? 
1 (x — fx)? (x — Ux)(y — My) (y— by)? 
y(t, y) = ay | AE = 2p ee ee | 
p Ox OxOy Oy 


which represents the general form of the two-dimensional Gaussian density. 

At this point we note a remarkable property of Gaussian variables: when the 
linear correlation coefficient is null, they are independent. In fact, setting p = 0 
in Eq. (4.40), a density corresponding to the product of two Gaussians in x and 
y is obtained. According to Theorem 4.1, this is the density of two independent 
random variables. The condition p = 0, which in general is only necessary for the 
independence, in this case also becomes sufficient. Therefore, we have the following 
theorem. 


Theorem 4.3 (On the Independence of Gaussian Variables) The necessary and 
sufficient condition for the independence of two jointly Gaussian random variables 
is that their linear correlation coefficient is zero. 


We therefore see that the presence of a null covariance (which implies p = 0) 
between Gaussian variables ensures the statistical independence between them (see 
Fig. 4.2). 

Reconsidering the marginal distributions (4.39), we note that, passing from 
standard to normal variables, two one-dimensional Gaussian distributions (3.28) in 
x and y are obtained, in which the correlation coefficient does not appear. Another 
important fact comes up here: the knowledge of two marginal distributions, that 
is, of the projections of the Gaussian on the x and y axes, is not enough for a 
complete knowledge of the two-dimensional density, because both correlated and 
uncorrelated variables have Gaussian projections. The knowledge of the covariance 


142 4 Multivariate Probability Theory 


Oy 
<Y|x> = yt P & (X— Hx) 
<y|x> 
o?[Y|x] = of (1-p”) 


° Xx 


Fig. 4.4 Two-dimensional Gaussian for uncorrelated variables (a), partially correlated variables 
(b), totally correlated variables (c). Also the marginal distributions for a given x9 value, the 
regression line (Y|x) and the corresponding dispersion o[Y|x] are shown 


or of the correlation coefficients is therefore essential for the complete determination 
of the statistical distribution of variables. 

Let us now make some further considerations on the shape of the two- 
dimensional Gaussian. Figure 4.4 reports three cases: uncorrelated, partially 
correlated and totally correlated variables. If we imagine cutting the curve with 
planes of constant height, the intersection gives rise to a curve of equation: 


= 2puv + v? = constant (4.41) 


for standard variables and to a curve: 


_ 2 _ = ie =e 
C a = @ = Hy = Hy) | Y a eer (4.42) 
Ox Ox Oy oy 


for normal variables. These curves are named concentration ellipses, centred on the 
point (j1,, Ly). If o = 0, the ellipse has the principal axes parallel to the reference 
axes, and it degenerates to a circumference when o, = oy. Finally, if o = +1, the 
variables are completely correlated and the ellipse degenerates into a straight line of 
equation: 


(x — Mx) | Y— By) 
oo 


= constant . (4.43) 


4.4 Two-Dimensional Gaussian Distribution 143 


In this case the normal density is completely flattened on a plane, as shown in 
Fig. 4.4c). We then have to do with a single random variable (X or Y), which is 
completely dependent on the other through an analytical equation. 

At this point it is necessary to focus on an important property of the two- 
dimensional Gaussian: when the principal axes of the ellipse are parallel to the 
reference axes, then p = O, the variables are uncorrelated and therefore, by 
Theorem 4.3, independent. It is then always possible, by operating an axis rotation, 
to transform a pair of dependent Gaussian variables into independent Gaussians 
variables. This property, which, as we will see, can be generalized to any number 
of dimensions, is the basis of the x* test applied to hypothesis testing in statistics. 
The rotation formula can be found, quite simply, by operating a generic rotation by 
an angle a of the reference axes on the standard variables (4.34): 


A= Ecoaton Tse (4.44) 

B=—Usina+ Vcosa 
These simple equations can be found in any basic textbook of physics or geometry. 
To be uncorrelated and independent, the new variables A and B must have their 
covariance (and, therefore, their correlation coefficient) equal to zero. We then 
impose: 


(AB) = (U cosa + V sina)(—U sina + V cosa)) 


—sina cosa ((u?} — (v?)) + (cos* a — sin? @) (UV) 
= 5 sin 2a ((u?) = (v?}) + cos2a (UV) =0. 


Since, for standard variables, (U - — (Vv?) = o” = 1, one obtains the condition: 


cos2a (UV) =0 =} cos2a=0 => 2a = +> Sa=40. (4.45) 


For this value of a, the system of Eqs. (4.44) becomes: 


1 
—=(U+V), B= 
Boe 


Also the inverse equations hold: 


cu +V). (4.46) 


- pi 


1 1 
U=—(A-B), V=-— (A+B), 4.47 
Won ) Won + B) (4.47) 


together with the norm conservation in a rotation: 


Fae Oe: eet Oe a (4.48) 


144 4 Multivariate Probability Theory 
By substituting Eqs. (4.46-4.48) in Eq. (4.41) we obtain: 
U? = 2pUV + V? = (1 — p)A2 + (1+ p)B". (4.49) 


With these transformations, the exponent of the standard Gaussian (4.38) assumes, 
in lowercase notation, the following form: 


pe |S 
ioeligeeee—ween €, iol uv Vv =>- > * 
21 = p2) y 2liap” Lap 


Since the Jacobian determinant of the transformation! (4.47) has the value: 


a an 
da db J2 V2 1 1 
— — oat + 4 1 ‘ 
dv dv 1 1 2 2 
da db J2 /2 
one obtains: 
1 1 2 2 
g(u, v; 0, 1) du dv = ————— exp = a — 2puv+v~) 
2nVJ1— p2 A{1 = p*) 
1 if @ j. b? da db 
= — ex = ——. 
2x P| 2\i+p ' 1-0/| fi-p 
1 1/9 2 
= 5 exp|—5 (a; es b;) dap dbp , (4.50) 
where: 
a b 
ap = bp = : (4.51) 


Eqs. (4.50, 4.51) contain two important results. 

The first one is that, given two Gaussian variables with op ~¢ +1, it is always 
possible to define two new uncorrelated variables (A, B) by passing to the standard 
variables and rotating of an angle @ = +45°. This angle corresponds to a rotation 
that brings the main axes of the concentration ellipse parallel to the reference axes. 
The angle double sign simply indicates that the rotation can consist of bringing an 
axis of the ellipse parallel either to the x axis or to the y axis of the reference system. 


! This quantity, which will be formally introduced in Eq. (5.21), gives the variation of the unit area 
or volume element during the transformation from one reference system to another one. 


4.4 Two-Dimensional Gaussian Distribution 145 


The second important fact is that, from the uncorrelated variables (A, B), 
it is possible to obtain two variables (A,, By) which are independent standard 
Gaussians. Since the sum of the squares of independent standard Gaussian variables, 
according to Pearson’s Theorem 3.3, is x7 distributed, the variable given by the 
exponent of the two-dimensional Gaussian (4.40): 


Q=V(X,Y), (4.52) 


when the X, Y correlation coefficient o 4 +1, follows the x? (2) distribution. 

If o = +1, there is a deterministic relation between A and B and the degrees 
of freedom decrease to one. We will also show later that the fundamental Eq. (4.52) 
holds for any number of dimensions. The fact that Eq. (4.52) represents a x~ variable 
does not require the knowledge of the explicit form of the independent Gaussian 
variables: it is enough to know that they can always be found, if o # 1. Therefore, 
the x7 calculation is usually performed with the original variables of the problem, 
even if they are correlated. 

Finally, we analyse Gaussian conditional densities. The density g(y|x), for a 
fixed x, is easily derived by applying Definition (4.13). In our case this requires 
to divide the density (4.40) by the marginal density gx (x), which is nothing more 
than the one-dimensional Gaussian density g(x; x, 0.) given by Eq. (3.28). After 
an easy rearrangement, one obtains: 


g(ylx) = ——e ee es (* -p—*) 
oy J 2n(1 — p?) 2(1—p?) \ oy Ox 
2 
[y= my — eZ @- nx) | 


1 
oy J2n(1 — p2) 202(1 — p”) 


(4.53) 


Since x is constant, this curve corresponds to a one-dimensional Gaussian with 
mean and variance given by: 


Oy 
(Y|x) = My + PaO — Mx), (4.54) 
Var[Y |x] = 0, (1 — p*). (4.55) 


These last formulae show that the mean of Y conditioned on x varies with x along a 
line, called regression line. On the contrary, the variance of Y remains constant and 
depends on x only through the correlation coefficient. 

An experimenter who conducts a series of measurements consisting in sampling 
Y for different values of x kept constant observes a mean moving along the line 
(4.54) and a constant variance (4.55). The conditional variance Var[Y |x] measures 
the dispersion of the data around the regression line and is never larger than the 
projected variance Pe Its value depends on the angle that the principal axes of 


146 4 Multivariate Probability Theory 


Fig. 4.5 Difference between y principal 
regression line and principal axis 
axis of the concentration 
ellipse 


regression 
line 


the concentration ellipse form with the reference axes. If o = 0 the two sets of 
axes are parallel, the regression lines coincide with the axes of the ellipse, there 
is no correlation and the two variances are equal; if o = 1, Var[Y|x] = 0 and 
no dispersion is observed along the regression line (see Fig. 4.4). The conditional 
density g(x|y), for a predetermined value of y, and the corresponding mean, 
variance and regression line are obtained from the previous formulae by exchanging 
y with x. Is a non-linear correlation possible between Gaussian variables? The 
calculations just carried out show that, if the (projected) marginal densities are both 
Gaussian, the relation must be linear. A non-linear relation distorts the Gaussian 
form on at least one of the two axes. We also anticipate that nonlinear dependence 
will be addressed in detail later, in Chap. 11. Finally, we note that the regression 
lines are, by construction, the locus of the points of tangency to the ellipse of the 
lines parallel to the axes and therefore do not coincide with the principal axes of the 
ellipse (see Fig. 4.5). 

In R, there are different ways to study and represent joint distributions. For 


example, to generate 1000 pairs of normal variables with means w,y = 5 and 
fty = 10 and covariance matrix with oe = oe = 3 and oxy = —2, one can write: 
> require (mvtnorm) # load library 


> xy <- rmvnorm(1000,c(5,10),sigma=rbind(c(3,-2),c(-2,3))) 


To easily obtain two-dimensional density graphs in R, we need to construct two 
vectors containing bins and frequency matrix. These operations are performed by 
our routine HistoBar3D, which can elaborate both raw data and histograms. The 
lines of code of HistoBar3D used to create vectors and matrices, starting from a 
matrix of raw data consisting of x;, yj pairs ordered by columns, can be of general 
interests and are detailed here: 


# create x and y bins in x.bin and y.bin 


x.bin <- seq(0.98*(min(xy[,1])), 1.02*(max(xy[,1])), length=nbinsx) 

y.bin <- seq(0.98*(min(xy[,2])), 1.02*(max(xy[,2])), length=nbinsy) 
# fill x.bin and y.bin cells with the number of events 

freq <- as.data.frame(table(findInterval(xy[,1], x.bin) 


findInterval (xy[,2], y.bin))) 


4.4 Two-Dimensional Gaussian Distribution 147 


freq[,1] <- as.numeric(freq[,1]) 
freq[,2] <- as.numeric(freq[,2]) 
freq[,3] <- as.numeric(freq[,3]) 

# freq2D is the matrix with bin x, bin y, freqxy 
freq2D <- matrix(0:0,nrow=nbinsx,ncol=nbinsy) 
freq2D[cbind(freq[,1], freq[,2])] <- freqI[,3] 

# marginal distributions 
xmarg <- apply(freq2D,1,sum) # row sum (x contents) 
ymarg <- apply(freq2D,2,sum) # column sum (y contents) 


The best way to understand these complex instructions is to open the window 
of R, generate xy, for example, with rmvnorm, and then interactively study how 
the above-described freq2D matrix is built up. If the input raw data are contained 
in two separate vectors x and y, HistoBar3D creates the two-column matrix xy 
with the instruction: 


xy <- matrix(c(x,y) ,ncol=2,byrow=FALSE 


For further details, you can directly examine the HistoBar3D routine. Notice also 
that this code uses the R routine bkde2D, which requires raw data and is described 
in Appendix B. 


Exercise 4.3 
There are two Gaussian variables, X and Y, where X has mean yw = 25 and 
standard deviation 0, = 6 and 


Y=104+X+Yp, (4.56) 
while Yr is normal with parameters j4y, = 0 and oy, = 6. 


Find covariance and correlation coefficient between these two variables. 
Check the obtained results with simulated data. 


Answer The mean and standard deviation of Y are given by: 


00 ipa Oo S65, wldeesS ieee Shen 


By defining AX = X — p,, AY = Y — py, from Eq. (4.24) the covariance 
between variables can be calculated as: 


Cov[X, Y] = (AX AY) = (AX(AX + AYp)) = (4°x) =o BAG, 
(4.57) 


(continued) 


148 4 Multivariate Probability Theory 


Exercise 4.3 (continued) 
where the condition (AX AYR) = 0 has been used, since the variables X and 
YR are uncorrelated by construction. The correlation coefficient is given by: 


_ CoviX, ¥] 36 


P= TS 


= 0.707. 
o[X]lo[Y] 6- 8.48 


Notice that Eq. (4.55) gives: Var[Y|x] = 07 (1 — p*) = 8.487(1 — 0.7077) = 
6 = OoR: Now we can check these results with a simulation. 
Opening the R console and proceeding interactively, we can write: 


> X <- rnorm(20000,mean=25,sd=6) 

> Y <- 10+X+rnorm(20000) 

> mean (Y) # the result appears in the console 
> mean (X) 

> cov(x, Y) 

> cov (X,Y) /sqrt (var (X) «var (Y) ) 

> HistoBar3D (X,Y) 


The results appear in the R console, while after the call to HistoBar3D the 
results of Fig. 4.6 are displayed in the graphics window. Since the correlation 
between X and Y is linear, the marginal histograms g(x) and g(y) have a 
Gaussian shape, as can be easily noticed from the graphs of Fig. 4.6. The 
means and standard deviations of these histograms are an estimate of the 
corresponding true parameters of the densities gy(x) and gy(y). They can 
be calculated with Eqs. (2.36, 2.39, 4.23). From 20000 simulated pairs, we 
have obtained: m, = 24.96, my = 34.95, sy = 6.02, sy = 8.48, Sxy = 
35.73, r = 0.700. These quantities are indicated in Latin letters because they 
are sample estimates of the true values (4.17, 4.18). 

Since the simulated sample contains a large number of events, all the values 
we got from the data coincide, within “a few per thousand”, with the real 
ones. The analytical characteristics of this convergence will be considered in 
Chap. 6. 


4.5 The General Multidimensional Case 


The generalization of the equations above to the case of more than two variables is 
quite immediate and not particularly difficult. The joint p.d.f. of m random variables 
is given by a non-negative function: 


P(X1,X2,-+-5Xn), (4.58) 


4.5 The General Multidimensional Case 149 


7500 


5000 


2500 


x 10 50 


10000 
40 


30 
5000 


20 


10 1 1 1 1 | 1 L 1 L 


—] 
iJ 
nn 


50 


g(x) g(x,y) x 


Fig. 4.6 A sample of 20000 events, computer simulated from the two-dimensional Gaussian 
population considered in the Exercise 4.3. Two-dimensional histogram of g(x, y), marginal 
distributions g(x) and g(y) and top view with the density curves (bottom right) 


such that, if A C R”, 
P{(x1,%2,..-,%) € AJ = / D(X1, X2,---,Xn) dx, dx2... dxy . 
A 


For independent variables, one has: 


P(X1, X2,---,Xn) = pi(x1) p2(x2)... Pn (Xn) - (4.59) 


150 4 Multivariate Probability Theory 


The mean and variance of X,; (with | < k < n) are obtained by generalizing 
Eq. (4.5): 


(Xk) = | Xk D(X1, X2,.--,Xn) dx, dx2... dxn = Ux , (4.60) 
Var[ Xx] = Jf (xk _ Lk)? D(X1, X2,..-,Xn) dx, dx2... dxn . 


In the case of several variables, the covariance can be calculated with Eq. (4.20) for 
any pair of variables (x;, xx): 


Cov[X;, Xx] = [oi-ureu—m P(%1,X2,.--,Xn) dxy dx2...dx,. (4.61) 


Therefore, one has n(n — 1)/2 different covariances and n variances. They are 
gathered in a symmetric matrix V, called covariance matrix: 


Var[X 1] Cov[X1, X2] ... Cov[X1, Xy] 
Ve bed Var[X2]  ... Cov[X2, Xn] (4.62) 
ahs Mes ae ae 
where the diagonal elements are the n variances and the non-diagonal ones are 
the n(n — 1)/2 covariances. The matrix is symmetric, since Cov[X;, Xx] = 


Cov[X;, X;]; for this reason it is explicitly written only above the diagonal. The 
number of different elements of the matrix is: 


n(n — 1) n(n + 1) 
——_— + n = —_—__... 
2 2 
Often the correlation coefficients p[X;, X,], obtained by generalizing Eq. (4.31), 
are used. The covariance matrix is then also written as: 


Var[X1] pLX1, X2]o[XiJo[X2] ... pLX1, Xnlo[XiJo[Xn] 


VarfX2] «1X2, XnloXalo[Xnl | (4.63) 


Var[Xn] 
If variables are uncorrelated, all the off-diagonal terms are zero, and the diagonal 


terms coincide with the variances of the individual variables. Sometimes, instead of 
the covariance matrix, the correlation matrix is used: 


1 p[X1, Xo]... pLX1, Xn] 


cafe 1 lke Xa] aie 


which is still symmetric. 


4.5 The General Multidimensional Case 151 


Generalizing Eq.(4.24) in several dimensions, the covariance matrix can be 
expressed as: 


(x — p)(X — 1)'| =V, (4.65) 


where (X — mz) is a column vector and the symbol + represents the transposition 
operation. In this way, the product between a column vector and a row vector gives 
the symmetric square matrix (4.62). 

In R, the calculation of the covariance matrix starting from a set of raw data 
collected in a matrix M and sorted by columns can be done with the routine cov (M) 
or var (M), while the function cor (M) must be used for the correlation matrix. 
Let us open the R console and explore these functions generating 100 triples of three 
correlated Gaussian variables: 


> a <- rnorm(100) 
> b <- a + rnorm(100) 
> Cc <- a - 2*b + rnorm(100) 
> M <- cbhind(a,b,c) # matrix 100 x 3 
> cor (M) 
a b ec 


a 1.000000 0.6725680 -0.3832170 
b 0.672568 1.0000000 -0.8875424 
c - 0.383217 -0.8875424 1.0000000 


The covariance matrix is immediately obtained by typing one of the two 
equivalent commands var (M) and cov (M). The matrices V and C often appear 
in multivariate probability calculus. It can be shown that they have the fundamental 
property of being positive semidefinite: 


x'Vx>0, (4.66) 


where the equality holds when all the n elements of the vector x are null. Notice that 
in the quadratic form of Eq. (4.66) a row vector x’ appears to the left and a column 
vector x appears to the right, so as to obtain a number (not a matrix) which, if V is 
diagonal, is a sum of squares of the type )> x? V;;. Since V' = V,(V—!)? = V“I, 
if x = V—'y one obtains the equation: 


yiVoty=yivo'lvvVly=x'ivx =0, (4.67) 


which shows that also the inverse matrix is positive definite. Another useful property 
is that, for any positive definite matrix V, there exists a matrix H such that: 


HH =V = HAV 'H=H = AV 'HeE=I, (4.68) 


where J is the unit matrix. All these properties are proved in detail in Cramer’s 
classic textbook [Cra51]. 


152 4 Multivariate Probability Theory 


It is also possible to generalize Eqs. (4.11, 4.13) to obtain different marginal and 
conditional distributions of one or more variables by integrating over the remaining 
ones. This generalization is obvious and is not reported here. 

When the X variables are Gaussian, they have a p.d.f. given by the multivariate 
generalization of the two-dimensional Gaussian (4.40): 


g(x) = ——|V|1? exp |-30 —ptvta— W| ; |V| =det|V| 40 


(4.69) 


1 
(27 (Qn)n/2 


where V is the covariance matrix of Eq. (4.62). 

The density (4.69) is called multivariate Gaussian and can be obtained by 
extending the procedure which led to the bivariate Gaussian. Obviously, it is easy to 
verify that the multivariate Gaussian contains the bivariate distribution as a special 
case. In fact, if o A +1: 


and Eq. (4.40) follows. 
If H is a matrix satisfying Eq. (4.68), with the transformation: 


HZ=X-—n, (4.70) 


the semidefinite form appearing in the exponent of the Gaussian becomes: 


n 
y(X) = (X— pV (XW a=ZAV'AZ=2Z=) °Z7, (471) 
i=l 


where the third of Eqs. (4.68) has been used. Equation (4.70) is therefore the equiv- 
alent of the two-dimensional rotation (4.44). The new variables Z are still Gaussian 
because they are linear combinations of Gaussian variables (see Exercise 5.3 in 
the next chapter); moreover, it is easy to verify that they have null mean and that, 
according to Eqs. (4.65, 4.70) and the third of Eqs. (4.68), are also uncorrelated 
standard variables: 


(22) = H'((x — wx — "ay 
H7'v (a')!=(aA'v"'aA)' =. (4.72) 
From Theorem 4.3, it follows that the Z; variables are mutually independent and 


that Eq. (4.71) represents a variable Q ~ x2 (n). Therefore, Eq. (4.52) also holds for 
the n-dimensional case. 


4.5 The General Multidimensional Case 153 


If |V| = 0, there is at least one linear relation between the n variables; it is 
then necessary to identify a number r < n of linear independent variables and 
modify Eq. (4.69) accordingly. We think it is useful to remind that two variables 
are stochastically independent (or dependent) if Eq. (4.12) holds (or not); instead, 
the linear mathematical dependence implies the existence of well-defined constraint 
equations among the variables. Two linearly dependent or constrained random 
variables correspond, statistically, to a single random variable. If the equation is 
linear, the Cauchy-Schwarz theorem 4.2 gives p = +1. Therefore, when dealing 
with random variables described by positive semidefinite quadratic forms of the 


type: 
(X — w)'W(X 4p), 


where X are Gaussian variables and W is a symmetric matrix; it is always necessary 
to verify if |W| = 0. In such condition, it is necessary to reduce the number of 
variables by determining, if there are p linear equations among the variables, (n— p) 
new linearly independent variables. Linear systems theory assures that it is always 
possible to determine these new variables in such a way to obtain a positive definite 
quadratic form of dimension (n — p), for which Eqs. (4.71, 4.72) still hold. It is 
therefore possible to express the quadratic form as a sum of (n — p) independent 
standard Gaussian variables 7;: 


n—p 


(X—w)'W(X— p= )'T?. (4.73) 
i=1 


At this point, we can generalize the Pearson’s Theorem 3.3 as follows: 


Theorem 4.4 (Quadratic Forms) A random variable Q ~ x?(v) is given by the 
sum of the squares of v stochastically independent standard Gaussian variables or 
by the positive definite quadratic form (4.73) of Gaussian variables. The degrees of 
freedom are given in this case by the number of variables minus the number of the 
linear relations existing between them. 


Multidimensional random variables can be treated as vectors in n-dimensional 
spaces, thus exploiting many of the results of the theory of n-dimensional R” vector 
spaces. Among these, we recall the scalar product of two variables (vectors) x and 
y: 


a= >) ae, (4.74) 
i=l 


which is simply the generalization of the scalar product between three-dimensional 
vectors. In a similar way as two vectors are defined to be orthogonal if their scalar 
product is zero, two sub-spaces A and A+ are then defined as orthogonal if each 
vector of A is orthogonal to each vector of A+. The sub-space A+ has dimension 


154 4 Multivariate Probability Theory 


equal to n — k, if k is the dimension of A. Moreover, we recall that each vector can 
be written in the form x = x; +.x2 where x; € A andx2 € At. Arrays, considered 
as operators, act on the elements of the vector space. The orthogonal projection 
operators P(A) : x — x1, which associates to any x € R” its component on A, are 
very useful. Projectors have many properties which recall those of the projections 
of vectors on the three Cartesian axes. For example, P(A)x is the vector of A at the 
minimum distance from x and the following equations hold: 


P(A)P(A) = P(A), P(A+)P(A)=0, I—P(A)=P(At). (4.75) 


The first property, called idempotence, is obvious, because P(A)x = x ifx € 
A, whereas the second one reflects the fact that the projection vector has zero 
components in the orthogonal subspace. The third property, where / is the identity 
or unit matrix, holds because P(A)(J — P(A)) = 0 for idempotence. The simplest 
orthogonal projection operator is perhaps the one reducing the non-zero components 
of the vector: 


P(A)x = (x1, %2,...,%%,0,...,0), (4.76) 


P(At)x = (0,0,...,0, e415 Xk425---5Xn)- 


Cochran’s theorem is based on this decomposition. We will use it several times in 
the following for Gaussian variables. 


Theorem 4.5 (of Cochran) Let X ~ N(0, I) be a n-dimensional standard Gaus- 
sian random variable and let A,, A2,..., Ax be mutually orthogonal vectorial 
sub-spaces in R". Let nj be the dimension of A, and P(A;) be the orthogonal pro- 
jectors on Aj. Then, the random variables P(A;)X,i = 1,2,...k are statistically 
independent and the variable |P(A;)X |? = (P(A;)X, P(A;)X) follows the x7 (ni) 
distribution. 


Proof In essence, the theorem affirms the statistical independence and stability of 
independent Gaussian variables projected onto orthogonal subspaces. We prove the 
theorem in the simple case of two subspaces of the type: 


P(A,)X = (X1,..., Xn,,0,...,0) 

P(A2)X = (0,...,0, Xnj4i,---, Xnj4nz) - 
If Y = P(A1)(X) and Z = P(A2)(X), we have Cov[Y;, Zj] = 0 Vi,j = 
1,2,..., by construction. Since, by assumption, the variables X are independent 


standard Gaussians, from Theorem 4.3 it also results that Y and Z are independent 
and that, from Pearson’s Theorem 3.3, the variables: 


|P(A)X[?? = X74+...4.X2 ~ x?(n1), 


n\ 


|P(Ag)X?? = X24 +...4+ X21, ~ x7 (2) 


4.6 Multivariate Probability Regions 155 


follow the x7 distribution. It can be shown that one can always be led back to this 
particular case, through orthogonal transformations that lead subspaces generated 
respectively by the first ny coordinates and by the remaining n2. In addition, the 
property (P(A;)X, P(Aj)X) = X * B;X holds, where B; is a positive semidefinite 
matrix, so that the theorem can also be formulated in terms of matrices whose rank 
sum is equal to the dimension of the space. These matrices can be diagonalized 
according to Eq. (4.71). Oo 


An important application of Cochran’s theorem will be described later on, as in 
Theorem 6.1. 


4.6 Multivariate Probability Regions 


We now consider multidimensional probability levels. The two-dimensional analo- 
gous of the integral (3.94) is given by: 


b d 
PlasX<bcs¥<d)=[ i p(x, y) dxdy, (4.77) 
a ¢ 


which gives the probability that, in an experiment, the numerical realizations 
of the two random variables lie within fixed limits. In this case, the probability 
interval turns into a rectangle, as shown in Fig.4.7a. When the lower bounds 
are ad = —o0,c = —o, Eq. (4.77) gives the two-dimensional analogous of the 
distribution function (2.33). Equation (4.77) can be extended in more dimensions by 
considering a variable X = (X1, X2,..., Xn) and a region D in the n-dimensional 
space. The probability that an observation gives a vector in the set D is given by: 


P{X €D}= : D(X1,X2,..+,Xn) dx, dx2... dxy. (4.78) 
D 
Fig. 4.7 Region of the a) b) 
probability P{|X — px| < - 


Ox, ly _ My| = oy} and 


corresponding concentration = --------=- - 

ellipse for uncorrelated 6 x 
normal variables (a); for y >» * 
standard variables, the frre 

rectangle and the ellipse 
transform to a square and to a 
circle of unit half-side or fv : 
radius centred at the origin, 
respectively (b) 


156 4 Multivariate Probability Theory 


For independent Gaussian variables, Eq. (4.77) results in the product of two integrals 
of one-dimensional Gaussian densities. When the probability intervals of X and Y 
are symmetrical and centred on their respective means, the probability levels can be 
obtained as the product of the functions (3.38): 


Uxt+a 1 = (xx)? 
PUX = ps| Sa, 1¥ ny <b) =2 f eo dex 
° : - V2 07x 
[y+ 1 a (y-ny)? 
x2 e° % dy=[2E(a/o,)]-[2E(b/oy)]. (4.79) 
[” ze /ox)]- [2 E@/oy)] 


The probability levels corresponding to the one-dimensional 30 law are therefore 
obtained as product of the one-dimensional probability levels (3.35). For example, 
the probability for both variables to be within +o, i.e. within one standard deviation, 
is given by: 


P{|X — p1x| < oy, [Y — wy| < oy} = 0.683 - 0.683 = 0.466. (4.80) 


Let us now consider the standard bivariate Gaussian for the case of uncorrelated 
variables (4.35) written in polar coordinates: 


u=rcos@d v=rsinge. 


Since the surface element du dv turns into r dr dé (the Jacobian of the transforma- 
tion is r), the function (4.35) becomes: 


1 
g(r, 850, 1) = s—re2", O<27, O<r<mw. (4.81) 
a 


This equation shows that, in the case of standard variables, the concentration ellipse 
of Fig.4.7a transforms into a circle centred at the origin (see Fig.4.7b). From 
here, it is then easy to derive the probabilities of falling within the concentration 
ellipse having the principal axes equal to k times the standard deviations of the two 
variables: 


we of 


Xx— 2 Y- 2 
pwr sv? <i) = p| Sap be ee <r} = 
y 


1 tr pk 1,2 1,2 
— os / re 2 drdd=1—e 2" . (4.82) 
—1 0 


4.6 Multivariate Probability Regions 157 


From this equation, a “30 law” in the plane is obtained: 


(X= ax)? | Y= dy)? _ 2 


2 2 = 
ox oy 


P 0.865 for k = 2 
0.989 fork =3. 


(4.83) 


2 0.393 fork = 1 
| = l—e 7 = 


Notice that the rectangular interval to of Eq. (4.80) has a greater probability (0.466 
versus 0.393) than that of the corresponding ellipse (4.83), which for k = 1 has 
the sides of the rectangle as principal axes. This fact is also shown in Fig. 4.7, 
where we see that probability rectangle (or square) contains the ellipse (or circle) of 
concentration. 

Let’s now consider, instead of two standard variables (U, V), two Gaussian 
variables (X, Y) of zero mean and having the same variance o”. If, we substitute 
r — r/o in the integral (4.82) and integrate in 0, we get the function: 


r a 2 2 2 
p= P =F ed x“+y°=r', (4.84) 


which, after integration in dr, gives the bidimensional p.d.f. for two independent 
Gaussian variables (X, Y) of equal variance. This function is named as Rayleigh 
density and is the two-dimensional analogue of the Maxwell density (3.75), which 
is valid in space. 

In R, this function is calculated with the routine rayleigh(scale) of the 
library bayesmeta, where scale =o, to which the usual prefixes d, p, q, 
r must be added to have the function, cumulative, quantile or random values (see 
Table B.2). 

You will have noticed that the probability intervals of a certain variable are 
different depending on the global number of variables and on the type of domain 
D that is considered in Eq. (4.78). Fixing ideas to the situation of two variables 
(X, Y), for a certain given probability level (e.g. 68.3%), we can define as many 
as five probability intervals: an interval (x + o,) for 68.3% probability of finding 
a value of X for any value of Y and the same by changing X with Y (so that we 
have two intervals), or the interval ((Y|x) + /Var[Y|x]) of Eqs. (4.54, 4.55) for 
the 68.3% probability to find an Y value for a fixed value xg and similarly for X 
for a fixed value yo (now we have a total of four intervals). Finally, we can also 
define a two-dimensional region D which includes 68.3% for the (X, Y) pairs. The 
densities involved in these five cases are the marginal densities px (x), py (y); the 
conditional densities p(y|xo), p(x|yo); and the joint density p(x, y), respectively. 
The region D of the plane where the multidimensional estimates are calculated is 
often the concentration ellipse: 


: Lat Oe Tg 
Q = constant = (X — p)'V-'(X—p) => ~~ a 2 ; (4.85) 
i=l 07 


158 4 Multivariate Probability Theory 


where the expression to the right of the arrow obviously applies in case of 
uncorrelated variables. This region has some important properties which we will 
briefly examine. In the case of n Gaussian variables, according to Theorem 4.4, 
the variable Q is distributed as x7(n). This allows us to use, in the calculation 
of the probability levels associated with the multidimensional concentration ellipse 
(4.85), the significance levels of x7 density reported in the Tables E.3 and E.4 of 
Appendix E. Since the marginal density of each variable is Gaussian, the variable: 


Xi — pi)? 
Ors esti Ls (4.86) 


oF 

follows the x*(1) density. The probability levels associated with {Q; = 1, 4,9} 
are the 68.3%, 95.4% and 99.7% Gaussian probability levels corresponding to the 
{./Q; = 1, 2, 3} values of the 7; variable, since P{Q; < x2} = P{T; < x2}. 

Tables E.3 and E.4 give two different ways of calculating the probabilities for the 
x? distribution: to obtain the values shown in Table E.4 for a given probability level 
a, just find the corresponding level 1 — a in Table E.3 and multiply the reduced chi 
square rae (v) of the table by the v degrees of freedom required by the problem (note, 
in fact, that Table E.4 reports the values of the total a not of the reduced one). The 
difference between the probability levels of each single variable and the joint ones 
is shown, for the two-dimensional case, in Fig. 4.8, where the value x7 = 2.30 is 
taken from Table E.4 for v = 2. Observing the figure, one should pay attention to 
an important property, which we will apply later: the projections on the axes of the 
curve corresponding to x* = | define the 1 — o intervals of the variables, referred 
to the respective marginal densities. This property, obvious in one dimension, also 
holds for two variables. This can be demonstrated by equating the x7 function (4.52) 
to 1: 


QO=Yy(X,Y)= (u2 — 2puv + v*) = 1; (4.87) 


1— p? 


and evaluating the intersection points of this curve with the regression line of 
Eq. (4.54) v = pu. The obtained result, i.e. « = +1, corresponds to an interval 
(tx + o;) and the same, obviously, holds for the projection on the y axis. It can be 
shown that this property is in general valid also for the n-dimensional case [BR92]. 

Similar to the one-dimensional case, the Gaussian model is often reasonably 
valid for a relevant part of multidimensional random phenomena. The use of the 
concentration ellipse and the x* density then provide a very powerful tool for 
predicting the probability of an experimental result consisting of a n-tuple of values 
(x1, X2,...,%Xn). If, on the other hand, the problem does not allow the use of the 
Gaussian model, it is necessary to resort to the solution of the integrals (4.78) by 
defining suitable probability regions, usually hypercubic or elliptical, and taking 
care of the correlations between variables. Often, to achieve this difficult objective, 
the Monte Carlo simulation methods described in Chap. 8 are used. 


4.7 Multinomial Distribution 159 


i) 


Y__ this band contains 


68.3 % of x values 


this band contains 


¢° = 2.30 *. x2 = 1.0 68.3 % of y values 


the ellipse contains 
68.3 % of (x,y) points 


Fig. 4.8 One- and two-dimensional probability regions. The curve corresponding to the value 
x? = 2.30 contains 68.3% of the variable pairs. The projection on the axes of the curve x* = 1 
gives the l-o probability intervals for each variable 


4.7 Multinomial Distribution 


The binomial density (2.29) describes the probability distribution of the variable 
I = (1, 1), defined as an experimental histogram with N events split into two bins 
with counts /; and J, respectively, when the true probabilities of an observation 
falling into bins 1 and 2 are, respectively, p = p; and (1 — p) = pz: 


N N! 
P{il=n, = b(x: N, p) = ——————p"(1-— p) X= m1 na 
{1 = 11, n2} = bx; N, p) ni a (1 — p) nll?! P2 
(4.88) 


with (pj + p2) = 1 and (nm, + n2) = N. This equation leads to an immediate 
generalization to the case of a generic histogram with k bins. The resulting 
density, called multinomial, represents the probability that the random variable 
I = (\,h,..., J) provides an experimental histogram with n; events in the i- 
th bin, when all the a priori probabilities p; are known: 


N! 7 


7 Po Py sce. ~~ AD) 


P{I =nj,n2,...,nx} = b(n; N, p) = ———_ 
ni!n2!...n 


160 4 Miultivariate Probability Theory 


(pit pat---+ pr) = 1, (nj tna+---tny)=HN. (4.90) 


Since each variable, compared to all others, meets the binomial density require- 
ments, the following relations hold: 


(li) = Npi: Var[Ji] = Npi (1 — pi) . (4.91) 


The variables (11, J2,..., J.) are dependent because of the second of Eqs. (4.90). 
Their covariance can be directly calculated with Eq.(4.61) or, in a much more 
simple way, by using the law of transformation of the variance. The calculation 
will be performed in the next chapter, in Exercise 5.11. The result is the formula 
(5.86), which we anticipate here: 


Covlli, 1] =o; =—Npipj. = GF J). (4.92) 


Covariance is negative, because more events in one bin imply less events in the 
other ones. In the one-dimensional case, the binomial density rapidly tends to 
the Gaussian density. The same property also applies to the multinomial density, 
which is a property of great importance for the statistical study of the histograms. 
Equation (3.24), which is valid for Np, N(1 — p) > 10, can then also be written 
as: 


_ 2 
b(n; N, pyecerp| -5 oP =exp| - 


1 (“ — Npi)* xe (nz — “py 
2 Np(1 — p) 


2 Npi Np2 
(4.93) 


where the last equality is obtained by setting p = pi, 1 — p) = p2,n = 4, 
(N —n) = n>. This suggests a symmetric form which can be generalized to the case 
of n variables (a trace of the proof can be found in [Gne76]). We can therefore write 
the fundamental result: 


k 


1 i — Npi 2 
b(n: N, p) « exp -5 3 a] (Np; > 10 Vi), (4.94) 
i=l ‘ 


where the sum must be extended fo all the k variables appearing in the sums (4.90). 
Therefore, we have approximated the multinomial density with a multidimensional 
Gaussian density, where the sum of the squares of the variables appears in the 
exponential. However, these variables are correlated through Eq. (4.90) and thus 
the degrees of freedom are really only k — 1. If we now combine this result with 
Theorem 4.4, we arrive at the important conclusion that we can state as follows. 


4.8 Problems 161 


Theorem 4.6 (Pearson’s Sum) Consider a histogram having k bins and represent- 
ing a random sample of size N. If p; is the true probability to observe an event in 
the i-th bin, the variable: 


k 
(i; — Npiy? 
Q ) Noi (itht-:--+k) (4.95) 


i=1 


where I; is the number of observed events in any bin, for N —> ox tends to the x? 
density with k — | degrees of freedom. 


This theorem is the key for using x test in statistics and can already be applied with 
good approximation if Np; > 10. 

We also note that Eq.(4.95) does not give the sum of squares of standard 
variables, since the variance of the J; variables is given by (4.91) and the vari- 
ables are correlated. Therefore the theorem non-trivially generalizes the results of 
Theorem 3.3, which refers to the sums of squares of independent standard Gaussian 
variables. 


4.8 Problems 


4.1 If X and Y are two independent uniform random variables, in the intervals 
[0, a] and [0, b] respectively, find the joint density p(x, y). 


4.2 If X;, i = 1,2,3 are three independent Gaussian variables with mean ,4; 
and variance oP, find (a) the band containing 90% of values of X1, (b) the ellipse 
containing 90% of values of the (X1, X2) pair and (c) the ellipsoid containing 90% 
of the n-tuples (X1, X2, X3). 


4.3 If X and Y are independent random variables, does the equality (Y|x) = (Y) 
hold? 


4.4 Calculate graphically, in an approximated way, the correlation coefficient from 
the concentration ellipse of Fig. 4.6 (top right). 


4.5 If Z=aX +bandU =cY +d withac $ 0, calculate p[Z, U]. 


4.6 The height of a homogeneous population is a Gaussian variable equal to (X) + 
o[X] = 175 +8 cm for men and (Y) + o[Y] = 165 +6 cm for women. Assuming 
there is no correlation, find the percentage of couples with men and women higher 
than 180 and 170 cm, respectively. 


162 4 Multivariate Probability Theory 


4.7 In principle, how would you solve the previous problem in the most realistic 
case (couples tend to have homogeneous stature) of a correlation coefficient p = 0.5 
between the height of the husband and that of the wife? 


4.8 In the independent roll of two dice, the value of the second die is accepted only 
if the number is odd; otherwise, the value of the first die is assumed as the second 
value. By indicating with X and Y the pair of values obtained in each test, find 
the probability density p(x, y), the marginal densities of X and Y, the mean and 
standard deviations of X and Y and their covariance. 


Chapter 5 ® 
Functions of Random Variables Cheek for 


Your textbooks fill with triumphs of linear analysis, its failures 
buried so deep that the graves go unmarked and the existence of 
the graves goes unremarked. 


Ian Stewart, “DOES GOD PLAY DICE? THE NEW 
MATHEMATICS OF CHAOS”. 


5.1 Introduction 


Before moving on to statistics we still need to deal with the following problem: if we 
consider several random variables X, Y,... defined on the same probability space 
and combine them into an analytic function Z = f(X, Y,...) (see Eq. (2.8)), we get 
anew random variable Z. If we know the joint probability density pyy...(x, y,...) 
of the original random variables, what is the density pz(z) of the Z variable? 

To fix ideas, let us consider the case of two variables X and Y. Two probability 
densities pyy and pz and a function f(x, y) are then involved, according to the 
following scheme: 


X, ¥ ~ pxy(®,y) => Z= F(X, Y) ~ pz). 


The density pxy is known, f(x, y) represents an assigned functional relation (sum, 
product, exponential or others), and our goal is to determine the density pz of the 
new random variable Z. This scheme, which can obviously be extended to any 
number of variables, represents the core of the problem we want to solve. 

Probability densities will always be indicated with the letter p(...), functional 
relations with the letter f(...). The random variable Z is defined according to the 
realizations a and b of the random variables X and Y, according to the scheme of 
Fig. 5.1: 


Z=Z(a,b)= f(X(a), Y(b)), ae A, DEB. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 163 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_5 


164 5 Functions of Random Variables 


real numbers real numbers 


y(b)=y 
z={(x,y)] 


x(aj=x > P{XY}= =P (z) 


0) 


Fig. 5.1 Definition of the random variable Z = f(X, Y) 


For discrete variables, the probability that Z belongs to a particular numerical set Rz 
is then given, according to Theorem 1.1, by the probability of obtaining {X € Ry} 
and {Y € Ry} (1.e.a € A and b € B) added over all cases satisfying the condition 
ZE Rz: 


P{Z €Rz}= > p(a, b) = > pxy(x,y)- 
[acA,beB): f(x,y)ERz] [weRy,yeRy):f(x,y)ERz] 
(5.1) 


If X and Y are independent, one has, from the compound probability Theorem 1.2: 


P{z € Rz} = > px(x)py(y) - (5.2) 
[weRy, yeRy): f(x, y)ERz] 


Figure 5.1 visually shows the meaning of Eqs. (5.1, 5.2). 

Let us now consider a simple example: let X and Y be the scores obtained by 
rolling a dice twice (1 < X, Y < 6), and let Z = f(X, Y) = X + Y the sum of the 
two scores. Define Rz as the set of the results smaller than 5 (Z = X + Y < 5), 
and calculate the probability P{Z € Rz}. If the two trials are independent, Eq. (5.2) 
holds and the probability to obtain a generic pair (x, y) is 1/6 x 1/6 = 1/36. 
Eq. (5.1) requires of summing up the probabilities for which Z < 5: (1,1), (1,2), 
(1,3), (2,1), (2,2), (3,1). Since there are six pairs, P{Z < 5} = 6/36 = 1/6. 

The calculation of densities which are functions of random variables, distributed 
according to the fundamental densities that we have studied so far, is often a 
very complicated task from the analytic point of view. Sometimes calculations 
appear to be unsolvable. However, a great help often comes from simple simulation 
techniques. Suppose, for instance, that we want to determine the distribution of the 


5.2 Functions of a Random Variable 165 


variable Z = In(5+ X)-sinh(X), where X ~ N(0, 1). If we can’t find the analytical 
solution, we can at least know the shape of the distribution with this simple code, 
which uses the density function described in Appendix B: 


> X <- rnorm(10000) 
> zZ <- log(5+x) *sinh (x) 
> plot (density(z,adj=0.01) ) 


The curve appearing in the R window shows the behaviour of the solution; we do 
not have the analytical form, but we can see its trend and calculate its fundamental 
parameters. For instance, we find that mean (z) ~ 0.314 and that var (z) ~ 7.15, 
with an uncertainty that will be calculated in the next chapter, and due to the fact 
that we have a sample of 10,000 data, but not the parent population. The statistical 
uncertainty or error can often be made negligible by increasing the sample of 
simulated data while remaining within reasonable calculation times. 

In the following, we will proceed by successive steps, first considering one 
variable, and then two variables, and finally indicating how to extend the procedure 
to n variables. We will also show how it is possible, considering only the transfor- 
mation of the mean and the variance through the function f, to obtain a simple and 
general solution, although approximate, of the problem. In the most difficult cases, 
simulation techniques can be used, and they must be part of the basic knowledge of 
any statistician. 


5.2. Functions of a Random Variable 


Let X be a continuous random variable with p.d.f. px (x), and let Z be a random 
variable depending on X through the functional relation: 


Z = f(X). (5.3) 


To determine the density pz(z), the known rules on the change of variable inside 
a function can be applied. However, attention has to be paid on how probability 
intervals are transformed. It is therefore appropriate to use probability integrals 
defined by the cumulative functions and use the key Eqs. (2.28, 2.35). In addition, we 
will also exploit the Leibnitz theorem about the derivation of an integral, the proof 
of which can be found in many math calculus books. Given a function z = f(x), if 
x=f = (z) exists and is differentiable in [x 1, x2], the equation: 


d f” dfo! a a 
re p(x) dx = q Pp(x2) — q P(x1) 
Z Ixy < / =f (x2) < Jf x1) 


— p(x2) pi) (5.4) 


~ fla) fa)’ 


holds, where the prime symbol indicates the derivative operation. 


166 5 Functions of Random Variables 


Fig. 5.2 The transformation 
z= f(x) 


Ay A2 


Let us now consider a generic continuous function f(X), as in Fig.5.2, and 
determine the probability that Z is less than a given zg value. Basically, we 
have to find the probability that Z lies below the line z = zo of Fig.5.2. From 
Eqs. (2.28, 5.1) and from the figure above, we obtain: 


P{Z < zo} = Fao) = >, i, px(x) dx , (5.5) 


where F7 is the cumulative function of Z and the intervals A; are those of Fig. 5.2. 
These intervals, except for the first and the last one, have as extremes the real roots 
X1,X2,...,Xn Of Eq. (5.3): 


zo = f (1) = f2) =--- = fn). (5.6) 


Equation (2.35) shows that, by deriving Eq. (5.5), the required p.d.f. is found. Since 
from Fig. 5.2 we see that the lower bound of each interval A; does not depend on z 
(and therefore has a null derivative) or, when it depends on z, always has a negative 
derivative (the function decreases), applying the Leibnitz formula (5.4) to Eq. (5.5), 
we can write the density pz(z) as the sum of all positive terms, taking the absolute 
value of the derivatives calculated at the real roots (5.6): 


dFz(z) _ m= Px(™1) | Px(x2) rae Px (Xn) 


= . 3.7 
dz IPG Dl 1 f'@a2)| lf’ nd | eo 


where the right-hand side is a function of z9 = z through the inverse of Eq. (5.6). 
The result is always positive, as required for a probability density. When there is 
only one real root of Eq. (5.3) and p(x) > 0, this formula coincides with the usual 


5.2 Functions of a Random Variable 167 


method of substituting a variable in an integral. This method has already been tacitly 
applied in the Exercises 3.10 and 3.11 on the Maxwell and Boltzmann distributions. 
Let us now apply the fundamental Eq. (5.7) to some other significant case. When 

Eq. (5.3) is given by: 
Z= f(X)=axX+b), (5.8) 


Equation (5.6) allows only one solution: 


Xj, = 


a 


Since f’(x) = a, from Eq. (5.7) one obtains: 


1 z—b 
pz(z) = — Px (=) . (5.9) 


|a| 
If, on the other hand, we consider the functional link: 
Z=f(H—uax’, (5.10) 


Equation (5.6) admits the solutions: 


uae, nati, (5.11) 
a a 


which give rise to the derivatives: 


If vl = I f'G2)| = 2a, [= = 2/az. 


These results, inserted in Eq. (5.7), allow the determination of the required p.d.f.: 


1 z Zz 
pz(z) = Da [px (-/2) + px (/2)| , 220. (5.12) 


If the density px (x) is the Gaussian of Eq. (3.28) and Eq. (5.8) is applied, Eq. (5.9) 
becomes: 


= 2 
roe | . (5.13) 


1 
24) = ——_ E 
e lalo./2x : 2a*0 
which is a Gaussian of mean and variance given by: 


Mz=aut+b, eo aaa 2 (5.14) 


Zz 


168 5 Functions of Random Variables 


We note that, since the transformation (5.8) is linear, Eq. (5.14) can also be deduced 
directly from Eqs. (2.63, 2.64). In the case of the quadratic transformation (5.10) of 
a Gaussian variable with zero mean, Eq. (5.12) becomes: 


1 Zz 
nc ae 0, 5.15 
Pale) o /27az ~~ ( a) = ( ) 


which is the gamma density (3.57) with k = 1/2. Mean and variance can be obtained 
from Eq. (3.58), or by integration by parts of Eqs. (2.54, 2.57), and are: 


z= ao’, a =2a’o*. (5.16) 


5.3 Functions of Several Random Variables 


We now generalize the results of the previous paragraph to the case of functions of 
several random variables. Let us first consider the simple case of a single Z function 
of n random variables: 


Z = f(X1, Xo,.-., Xn) = F(X). (5.17) 


The analogous of the cumulative function (5.5) is now given by: 


P{Z < zo} = Fz(z) =], 5 DPX(X1,X2,---;Xn) dx, dx2... dx, 
€ 
(5.18) 


where px (x1, X2,.--,Xn) is the p.d.f. of the m variables and D is the set of the 
n-tuples X = (x1, %2,...,Xn) suchas P{Z < zo}, according to Eq. (5.1). 

It should be noted that the probability density of the variables X appears only as 
an argument of the integral, while the functional link Z = f(X) appears exclusively 
in the determination of the integration domain D. 

In many cases, the derivation of the cumulative (5.18) solves the problem of 
determining the density of p,. This method is the generalization of the one used in 
Sect. 3.8, where we derived Eq. (3.61) to obtain the x2 density. As an alternative, 
to deal with the more general case of n variables Z which are functions of n 
parent random variables X, one can use a well-known formula based on the general 
theorem of the change of variable in an integrand function. Since this theorem is 
proved in many mathematical analysis texts, here we report only its statement: 


Theorem 5.1 (Change of Variable in Density Functions) Let X = (Xj, X2,..., 
Xn) ben random variables with joint density px (x), and let Z = (Z\, Z2,..., Zn) 


5.3. Functions of Several Random Variables 169 


be n variables related to X by n the functional relationships: 


Zi = fi(X) 
Z2 = f2(X) (5.19) 
Zn = fn(X), 


which are all invertible and differentiable with continuous derivatives with respect 
to all arguments (there is a one-to-one correspondence between the two domains of 
X and Z, for which X; = f, '(Z), etc.). 

The p.d.f. pz is then given by: 


PZ(Z1, 22) +++) Zn) = PX (X11, X2,---5 Xn) | JI 


=p84 Crile Ovcnde OU, G20) 
where |J| is the Jacobian, defined as the absolute value of the determinant: 
af, '/dzi af, '/8z2... Af, | /8zn 


Of, | /8z1 Of; |/Az2 ... fy | /8zn 
[J] = (5.21) 


Of! /dz1 Of '/dz2 ... af 1/dzZn 


Obviously, the transformation is possible if all the derivatives are continuous and 
|J| 4 0. When there is not a unique invertible transformation f; (i = 1,2,...,n), 
it is necessary to subdivide the domains of X and Z into m disjoint subsets between 
which there is a one-to-one correspondence and then to sum Eq. (5.20) on these 
domains: 


Pz, 22s-++52n) = > PX (fir @s fig @r- fig @) el. 6.22) 
L=1 


The theorem can also be applied to the case of the particular transformation (5.17). 
In fact, let us consider a bivariate Z function: 


Z = f(X1, X2). (5.23) 


170 5 Functions of Random Variables 


To apply the Jacobian determinant method, we define Z; = Z and an auxiliary 
variable Zz = X2. Equation (5.19) then becomes: 


Zi = f(X1,X2),  Z2=X2. (5.24) 


The density of Z = Z, can then be found by applying Eq. (5.20) and then integrating 
on the auxiliary variable Z2 = X2. Since the Jacobian is: 


afi aft afo! 
dz1 -0z2 
l= = |=) 

Z1 

0 1 

from Eq. (5.20) one obtains: 
-1 
of, 
PzZ(Z1, 22) = Px (%1, X2) pel (5.25) 


The Z p.d.f. is obtained by integration on the auxiliary variable Z2: 


pz, (Z1) = [ pxces.22) 620. (5.26) 


Recalling that Z; = Z, X2 = Zp» and that: 


Of, _ af 


7 xX = (Zi Xe = a Z,X ’ 
azi ag 1 ff, (Z1, X2) fi ( 2) 


pz(2) = pz, (Zz), 


we can write Eq. (5.26) as: 


F) -1 
! dx2 


ps / peer) 


= [ vx (47 '@, x2), x2) 


which represents the requested p.d.f.. This formula provides an alternative to 
Eq. (5.18) to find the density pz(z) when the functional relationship is of the type 
Z = f(X1, X2). If the variables X; and X2 are independent, the density py 
factorizes according to Eq. (4.6), and Eq. (5.27) becomes: 


af 


Oz 


dx , (5.27) 


“1 af 
pz(2) = ‘i Px, (fy '&22)) Pre (ea) 


dx2. (5.28) 


5.3. Functions of Several Random Variables 171 


When the variable Z is given by the sum: 
Z=X,;4+X2, (5.29) 


the inverse function f—! and its derivative to be inserted in Eq. (5.27) are given by: 


“1 afi 
X1=f, (Z,X2)=Z—-X2, F =1, (5.30) 
Zz 
and the following result: 
+00 
pz(z) = i; px (Z — X2, x2) dx2 (5.31) 
—oo 


is obtained. 
If the two variables X; and X2 are also independent, then the p.d-f. of Eq. (4.3) 
factorizes in Eq. (4.6), and the previous integral becomes: 


+00 
pz(2) = PX, (Z — X2) Px. (x2) dx2 . (5.32) 


This is called the convolution integral. It is often met, both in statistics and in 
experimental physics, during the analysis of laboratory measurements. In these 
cases, pz is an observed random signal (for instance, an image), px, is the 
true signal (the true image) and px, is a blurring or apparatus function. In such 
conditions, it is necessary to determine py, when pz is observed and px, is known. 
This is achieved by integral inversion using deconvolution algorithms. You can 
easily imagine the importance and the widespread use of these techniques, from 
medical diagnostics to astrophysics. We will return to this issue in Sect. 12.15. 

After so much mathematics, we also note that this last integral has a simple 
intuitive explanation: the probability of observing a value Z = z is given by 
the probability of obtaining a value x2 times the probability of having a value 
x1 = Z— x2, so as to satisfy the equality z = x; + x2. This probability must be 
added for all the possible values of X2. The convolution integral thus appears as a 
further application of fundamental laws (1.23, 1.24) for continuous variables. We 
also note that Eq. (5.31) can be derived from the cumulative function (5.18). Indeed, 
since: 


PUX+Y <2) = Fz@) = i; iene aaa 
(X+Y 


Sz) 


+oo Z-x2 
/ a [ Px(%1,x2)dx1, 
—0oo —oo 


172 5 Functions of Random Variables 


Equation (5.31) can be obtained again by deriving with respect to z and applying 
Eq. (5.4). 

In R there are many possible ways to perform convolution integrals. Our routine 
ConvFun solves Exercise 5.1 and, with few modifications, also the other exercises 
and in general many simple convolution problems. If we denote by fun1 (x) and 
fun2 (x) two R or user functions which deliver a value according to the input value 
x, the lines of code of ConvFun that calculate the convolution integral between 
funi and fun2 are given by: 


£.X <- function(x) funl (x) 
£.Y <- function(x) fun2 (x) 

# Svalue extracts from integrate the value of the integral 
£.Z2 <- function(z) 


> 
> 
> 
> 
> integrate (function(x,z) £.X(z-x)*f.X(x),-Inf,+Inf,z) $value 
> £.Z <- Vectorize(f.Z) 

> # as an example, 

> # the z vector has limits [-4,+4] in steps of 0.02 

> Zz <- seq(-4,+4,0.02) 

> plot(z,£.2(2),type='1") 

The first statements formally define the two functions £.X and £.Y, and then a 
third function f . Z containing the routine integrate that performs the convo- 
lution. These lines of interactive code are all variables that contain R statements. 
The Vectorize statement is important because it assigns the £ . Z function to the 
vector class, so that all R vector functions can be applied to it. After this assignment, 
if z is a vector, the same holds for £.2Z(z), which allows it to be used as an 
argument to plot in the next statement. The actual convolution computation occurs 
within the plot call, when you assign the z argument to the function f . Z (z) . The 
R online manual contains additional useful information to understand these lines of 
code. 


Exercise 5.1 
Find the density of the random variable: 


=X ae 


where X ~ N(w, o*) and Y ~ U(a, b) are independent. 


(continued) 


5.3. Functions of Several Random Variables 173 


Exercise 5.1 (continued) 
Answer From Eqs. (3.28, 3.79, 5.32), one has: 


oe [ Oe Gay 
OO Fah, odin | at | 
b = i 2 

ae / ! exp [Poe dy. (5.33) 


b-a oJ 2m 202 


This density is nothing more than a Gaussian with mean (z — jz) and variance 
o” integrated within the limits of the uniform density. Using the cumulative 
Gaussian (3.43), one can rewrite it as: 


: je (P=) -o (SE )). (5.34) 
b-a oO fon 


The instrument used to measure physical quantities is often associated with 
a random uniform dispersion, while the measurement operations are usually 
associated with a random Gaussian dispersion (see Chapt. 12). In these cases, 
Eq. (5.34) gives the total smearing of the measure and is therefore important 
in the study of error propagation, which will be discussed in Sect. 12.9. 

The R code lines needed to obtain Fig. 5.3 are: 


pz(z) = 


> £.X <- function(x) dnorm(x) 

> £.Y <- function(x) dunif (x,min=-2,max=+2) 

5 # Svalue estracts from integrate the value of the integral 
S42, Y4 SS. Se bigvohealteyal (i) 

> integrate (function(x,z) £.X(z-x)*f.X(x),-Inf,+Inf,z) $value 
> £.Z <- Vectorize(f.Z) 

> z <- seq(-4,+4,0.02) 

2 jollone (4,38 cvs) ences le, es=2))) 

= lives (aie Wee) esas I”, they) 

= dibiaveys! (A, 12 eva) exes IL", dhesyaal)) 


This figure shows that the shape of the resulting density is rather similar to a 
Gaussian. An equivalent result is obtained with a call to our routine 
(Cloranyiaiuial (iE O18 WW, la, mA Sa) 


174 5 Functions of Random Variables 


0.4F = 


0.3 


4. 2 0 2 4 


Fig. 5.3 Convolution of a standard Gaussian having «7 = 0 and o = | (dashed line) with an 
uniform density over [—2, 2] (dash-dotted line). The full curve is the resulting distribution 


Exercise 5.2 
Find the density of the variable: 


=X ae W 


where X ~ U(0, 1) and Y ~ U(0, 1) are independent. 


Answer Since 0 < X,Y < 1, one has 0 < Z < 2. Also in this case, from 
Eqs. (3.79, 5.32) one immediately obtains: 


pz@)= fure-vuxe dx , 
where u(x) is the uniform density: 


Wey lif O<x<l 
~ | 0 otherwise . 


(continued) 


5.3. Functions of Several Random Variables 175 


Exercise 5.2 (continued) 
The uniform density arguments, which are the variables (z — x) and (x), must 
therefore lie between 0 and 1. The integral is then composed of two terms: 


Zz 1 
Ppz(z) — I! ax| + ll ax| 5 
0 0<z<1 gall lez=2 


and gives the result: 


z if O<z<l 
pw@) =]\ 22 l<gs2Z. (5.35) 
0 otherwise 


This density is normalized between 0 and 2, triangular and with a maximum 
in z = 1. Also this function will be extensively discussed during the study 


of the error propagation of two measurements affected by instrumental errors, 
which will be carried out in Sect. 12.9. 


Exercise 5.3 
Find the density of the variable: 


Z=X+Y, 
where X,Y ~ N(p, o”) are two independent Gaussian variables. 


Answer Also in this case, from Eqs. (3.28, 5.32), one immediately obtains: 


+00 = 2 == 2 
i oo] -£ Boe Gee |. 


pz(Z)= 


2T0x0y J—oo 207 2a 


The integral appearing in this formula can be solved with the method 
discussed in Exercise 3.4. It is of the type: 


+00 2) 
fed - PT GENAGESB 
g Ax*+2Bx—C dx esta a 
—00 A 


(continued) 


176 5 Functions of Random Variables 


Exercise 5.3 (continued) 


where: 
2 2 2 2 
eee es Be a ey hea Gay) 
Ae 2, pS ,C=— 5.36 

2a, Ge 2o7 a 204 202 a 207 ( ) 

One then obtains the density: 
[z— (ux + Hy)? 
pz) = ———===——= exp | -—_ >| (5.37) 
J2n ,/o2 + 02 Bee os) 


which is a Gaussian with mean and standard deviation given by: 


Mz =x + by, O,=,/o7 +0). (5.38) 


Since the transformation is linear and the variables are independent, Eq. (5.38) 
is in agreement with Eqs. (4.8, 4.19). However, this exercise tells us a new 
and very important fact: the linear composition of Gaussian variables again 
generates Gaussian densities. 

Equation (5.37) can also be easily proved using the property (C.4) of the 
generating functions of Appendix C. 

When a density retains its functional form by linear composition of several 
variables, it is said to be stable. Notice that according to the Central Limit 
Theorem 3.1, the sum of N random variables tends to follow the Gaussian 
distribution for NV large enough, in practice for N > 10. Well, if the starting 
variables are already Gaussian, this condition can be removed and the property 
holds for any NV. 

The set of these properties is the basis of the central role that the Gaussian 
or normal density assumes both in probability theory and in statistics. 


Exercise 5.4 
Find the density of the variable: 


“= X se 


where X and Y are two independent Poissonian variables, with means jz; and 
[42, respectively. 


(continued) 


5.3. Functions of Several Random Variables 177 


Exercise 5.4 (continued) 
Answer Equation (5.32), for integer variables X, Y > O, must be rewritten as: 


pz(z) =), px(x) pr—x), (5.39) 
x=0 


where both the densities py and py are given by the Poisson distribu- 
tion (3.14). Therefore, one has: 


Se ee 
z(z) = eo tH) 
e EEN 


Multiplying and dividing by z! and remembering Newton’s binomial formula: 


Zs 


Fees, eat 
(41 + (2) => kc)! Mio , 
k=0 
one obtains: 
s ee z! x (tye)? _ 

— op (1+H2) Ni a Cae (M1 +K2) 
Ze—se _— ) ————_. — ; 

a Ae eee: z! 


(5.40) 


from which it results that the required density is a Poissonian with mean (1+ 
[12). 

We can get the graph and the values of Poissonian convolutions again 
using the call to our routine ConvFun(f£.X,£.Y,cont=FALSE) , which 
can also deal with discrete distributions by applying Eq. (5.39). 

We note that, unlike the Gaussian case, only the sum, but not the difference 
of Poissonian variables, is Poissonian. In fact, if Z = Y — X, it is possible to 
have Z < 0, and in Eq. (5.40) the term (z + x)! appears instead of (z — x)!. 
It is clear then that the p.d.f. of the difference is not Poisson distributed. This 
distribution can be studied by changing the sign in Eq. (5.39) or again with 
the call ConvFun(f£.X, £.Y, cont=FALSE, sign=FALSE) 


In the following two exercises, we will determine the Student and Snedecor’s 
densities, which will be used later in statistics. So, don’t skip the exercises (at least 
read the sentence and the solution), and pay attention. 


178 5 Functions of Random Variables 


Exercise 5.5 

Find the probability density of a variable Z defined as the ratio between a 
standard Gaussian variable and the square root of a Qr variable following 
the reduced x* density with v degrees of freedom. These variables are also 
mutually independent. 


Answer Let us denote by X the standard Gaussian variable with density (3.42) 
and by Y the x* variable with density (3.67). We must then evaluate the 
density of the variable: 


= Jv a (5.41) 
= = i — ae . 
VY/v AE 
Since X and Y are independent, we can apply Eq. (5.28), with: 


ne 


1 
F=f" @N= ez ears ea 


We then have, by using the product of the densities (3.42, 3.67): 


O= fo oe eam 
IS NP ICES) f y exp 3” an y. 


Now let us change the variable of integration as: 


Life Dh a 
Nae ape el erred 
eG Es) 
which results in: 
1 eye 1 2 


Pz@) = Jim 2/2 F (3) J/2 ‘2 4 yy é Be 1) 


Vv 
oe) 
v-l 
a qzetdg. 
0 


If we recall the definition (3.64) of the gamma function, a direct calculation 
gives: 


v+1 


r(4 ie Zs 
Oe eI : 


(continued) 


5.3. Functions of Several Random Variables 179 


Exercise 5.5 (continued) 
This is the well-known Student’s density, which takes its name from the 
pseudonym used by the English statistician W.S. Gosset, who derived it at 
the beginning of the twentieth century. It is usually written as the density of 
a variable t, where the identity 7 = I"(1/2), shown in Eq. (3.65), is also 
applied. 

Therefore, the distribution of a variable t, defined as the ratio between a 
standard Gaussian variable and the square root of a xe (v) variable, is: 


v+1 


= Ne) Le eae 5.42 
SS are ae : (5.42) 


The integral values of the Student’s density are shown in Table E.2 in 
Appendix E. Both from this table and from Fig.5.4, one can easily verify 
that this density is very similar to a Gaussian when the number of degrees of 
freedom is greater than 20-30. The values of the mean and variance can be 
obtained, as usual, using Eqs. (2.54, 2.57), and are given by: 


eae ee (5.43) 


The variance is then defined only for v > 2. For v < 2 the function s,(f) is 
an example, rather unusual but possible, of a density without variance. In this 
case, the parametrization of the probability interval (3.94) in terms of standard 
deviation is no longer possible, and to obtain a given probability level, it is 
necessary to directly calculate the integral of the density within the assigned 
limits. These probabilities can also be obtained from Tab. E.2. 
In R the function t( ,d£, ) computes the Student’s distribution with 

df degrees of freedom. The call sequences use standard R prefixes: 

dt (x,df) # function value in x 

pt(q,df) # cumulative value of the quantile q 

qt(p,d£) # quantile value of index p 

rt(n,df) # vector of n random variates of f¢ 


180 5 Functions of Random Variables 


Fig. 5.4 Student’s density for 1, 5 and 20 degrees of freedom. The dashed line represents the 
standard Gaussian 


Exercise 5.6 

Find the p.d.f of a random variable F given by the ratio of two independent 
random variables following the reduced x? distribution, with respective 
degrees of freedom yp and v: 


F- Or(H) 
Or(v) 
According to the statistical practice, the ratio (5.44) should be written in 


capital letters. We will then denote with F the values assumed by the 
Snedecor’s variable F. 


(5.44) 


Answer We define: 


(continued) 


5.3. Functions of Several Random Variables 181 


Exercise 5.6 (continued) 
8) —1 
‘Ses x 


Y=f UF YS EX, aE 


The reduced x7 density with v degrees of freedom is given by the integrand 
of Eq. (3.72): 


1 1 
jpuGe) = Gi a en ae 


where: 


(5.45) 


In this case, Eq. (5.28) becomes: 


oy | 
OF 


PuslF)= [py (FED) polo) dx 
0 


= aay | PHAN BAAN 6 BPX/2 yv/2-1 —vx/2 dx 


Siac / ypty-2D o-HuF yx gy. 


After the change of variable: 


D) 
= dw 
wF+v 


o= SUF +»), dx 


one finally obtains: 


Ay ay FHI} 93H») 


1 
Puv(F) = feo eda. 


(uF + v) rut”) 


Recalling Eq. (5.45) and the integral form of the gamma function I"[(u + 
v)/2] of Eq. (3.65), we can write: 


v(F) = Cyy F209 (uF + v2 , (5.46) 
Pu bh 


Ip (=) 


= ie we 
ea) 


(continued) 


182 5 Functions of Random Variables 


Exercise 5.6 (continued) 

This result represents the well-known Snedecor’s density F of the variable F. 
It is displayed in Fig.5.5 and is extensively used in the analysis of variance 
(ANOVA) method, which will be discussed later in Sect.7.9. Mean and 
variance are as usual calculated from Eqs. (2.54, 2.57) and are given by: 


_ v - D(a ew = BD 


These equation are valid for v > 2 and v > 4, respectively. When the degrees 
of freedom u,v — ov, the F density tends to a Gaussian distribution; 
however, as the figure shows, the convergence toward this function is rather 
slow. 

The values of the ratio F = F, (4, v) corresponding to the 95th percentile 
(u = 0.95) and to 99th percentile (u = 0.99) and, therefore, at the significance 
levels of 5% and 1% to the right of the mean, are given by: 


Fy (,v) 
u =| Pu(F) dF, (5.48) 
0 


are the most frequently used in ANOVA. They are reported in Tabs. E.5 
and E.6 of Appendix E. The percentiles (u = 5%) and (u = 1%), correspond- 
ing to the significance levels of the tails to the left of the mean, are generally 
not given, because of the following crossing property between quantiles: 


1 


Re ren oar 


(5.49) 


This equation can be proved by observing that, by definition: 
Fy—u (Hv) o-0) 
1-u= | Puv(F) ar = f Purl(F) dF, 
0 Fy (u,v) 


and that, since F is a ratio: 
: 1 
if F ~ py) then ¥F ~ Pv pwl(F) . 


Therefore, one can write: 


1 


u= PIF < Fults =P] Flo. 1)| 


(continued) 


5.3. Functions of Several Random Variables 183 


Exercise 5.6 (continued) 


1 1 
= ie vw} =e{r<-——| , 
F 5 Fi_-u (vy, L) 
and Kq. (5.49) follows. 
In R, the function £( , df1, df2, ) evaluates the F density with 
df1 and d£2 degrees of freedom, with the calling sequences: 
GE(x,dfl,df2) # function value in x 


pfi(q,df£1,d£2) # cumulative value of the quantile q 
qf (p,df1,df2) # quantile value of index p 
) 


rf(n,df1,df2) # vector of r variates of F 


To numerically verify Eq.(5.49), it is enough to check the equality 
of quantile values such as gf (0.3,d£1=3,df£2=4) and 1/qf(0.7, 
df£1=4 , df2=3) ; in both cases the obtained value is 0.5038967. 


1.0 


0.8 


P(F) 
0.4 


0.2 


0.0 


Fig. 5.5 Distribution of the Snedecor’s F density with respective degrees of freedom y and v. 
Full curve: jw, v = 2; dashed curve: , v = 5; dash-dotted curve: uw, v = 10 


184 5 Functions of Random Variables 


One could write some books about functions of random variables (and, as a matter 
of fact, they have been written ...). If you like mathematical analysis and wish to 
learn more about this subject, you can refer to the classic text of Papoulis [PUP02]. 


5.4 Mean and Variance Transformation 


As you have seen, the determination of the probability density of a variable 
expressed as a function of other random variables is a rather complex subject. In 
the examples developed so far, we have already met some non-trivial mathematical 
complications, even though we have limited ourselves to the simple case of only 
two independent variables. 

Fortunately, as in the calculation of elementary probabilities, an approximate but 
satisfactory solution of the problem can almost always be obtained by determining, 
instead of the complete functional forms, only the central (mean) value and the 
dispersion (standard deviation) of the density functions under study. This is what 
we now intend to develop now. 

Let us start with the case of a single variable Z which is function of a single 
random variable X following the known density px (x), Le.: 


Z = f(X). (5.50) 


Assuming f to be invertible, X = f—!(Z), and the average of Z is obtained from 
Eqs. (2.54), (5.7) and differentiating Eq. (5.50): 


dz 


Tae) 
= i f(x) px(x) de. (5.51) 


(Z) = f aces f ropx 


It turns then out that the mean of Z is given by the mean of f(X) with respect 
to the density py (x), according to Eq. (2.68), which is the definition of expected 
value. This result, also valid in a multidimensional space under the conditions of 
Theorem 5.1, allows to obtain the central value of the density of Z in a correct and 
quite simple way. In many cases, however, an approximate formula is used, which 
is obtained by the second-order Taylor expansion of the function f(x) about ju, the 
mean of the original variable X: 


1 
fa)= f+ fw - w+ sf Wo —py. (5.52) 


5.4 Mean and Variance Transformation 185 


By inserting this expansion into Eq.(5.51), it is easy to verify that the term 
containing the first derivative vanishes and that, therefore, the approximate relation: 


1 
(Z) = fm) +5 Faye” (5.53) 


holds, where o is the standard deviation of X. 

This important equation shows that the mean of the function f (i.e. of Z) is equal 
to the function of the mean plus a corrective term that depends on the concavity of 
the function around the mean of X. If Eq. (5.50) is linear, 


Z=aX+b, 
then the second derivative in Eq. (5.53) vanishes and one obtains: 

(Z) = f (X)) . (5.54) 
In the non-linear case, it is possible to show that (Z) > f ((X)) if f” ((X)) > 0, 
whereas (Z) < f ((X)) if f” ((X)) < 0. 


We now come to the transformation of the variance. As in Eq. (5.51), one can 
write: 


Var[Z] = / [f(x) — (fOO)P px) de. (5.55) 


By recalling the approximate result of Eq.(5.53) and using the second-order 
expansion of Eq. (5.52), one obtains: 


1 2 
Var[Z] ~ / [fy = Fy — 5 F"W0?| px (x) dx 


1 1 - 
= / [yuo — w+ sf" We - wy — sf" we| px(x) dx . 


Carrying out the square in the integrand and remembering definition (2.59) of the 
moments of a distribution, after a somewhat long but easy reworking, one obtains: 


1 
Var[Z] = Lf’ (u)?o? + : Lf’ (wm)? (Ag — 04) + fF") As (5.56) 


where A; are the moments defined in Eq. (2.59). Since this is still an approximate 
relation, to have an acceptable estimate of the variance of Z, it is often sufficient to 
know only the order of magnitude of the moments. If the density of X is symmetrical 
around the mean, we have A3 = 0. If, in addition to being symmetric, the density is 
also Gaussian, then, based on Eq. (3.33), Ag = 3 o* and Eq. (5.56) becomes: 


1 
Var[Z] = [f’(u) Po? + 5 if Gore. (5.57) 


186 5 Functions of Random Variables 


If the density px (x) is symmetric but not Gaussian, using Eq. (5.57) usually does 
not introduce large errors. 

When the standard deviation of px (x) is small, that is, when o* > o%, the 
additional approximation 


Var[Z] = [f’(u)Po* (5.58) 


holds. This equation becomes exact when between Z and X there exists a linear 
relation. In this last case, Eqs. (5.54, 5.58) coincide with Eqs. (2.63, 2.64). 

It is also useful to verify that, when Z = aX? and X follow the Gaussian density, 
Eqs. (5.53, 5.57) give the correct result (5.16). 

Let us now deal with the more general situation, consisting of one variable Z 
which is function of n variables X;: 


Z = f (X1, X2,...,Xn) = f(X). (5.59) 


This case can be easily handled if one uses the linear approximation of Eqs. (5.54, 
5.58). In other words, we assume that the mean of the function f coincides with the 
function of the means and that the variance of Z depends only on the first derivatives 
of f and on the variances of the distributions of X. 

We begin with the simplest situation of two random variables: 


Z= f(X1,X2). 
If the function f is linearized around the means /41, {42 of the two original variables 
X,, X2, one obtains: 


Z> f(Hi, Ha) + Lae —fi)+ a es — 2) 
Oxy 0x2 


a a 
= f(u1, M2) + de + oF hee, ; (5.60) 
Oxy 0x2 


where the derivatives are calculated in x} = 41, x2 = [2. 
The mean of Z is obtained by extending Eq. (5.51) to two variables: 


(Z) = / f (x1, ¥2) px (%1, x2) dx; dx2 , (5.61) 


and by substituting the expansion (5.60) for f (x1, x2). Since the terms of the type 
(xj — i) px (%1, X2) are cancelled by the integration, the final result is simply, as 
always in linear approximation, that the mean of the function coincides with the 
function of the means: 


(Z) = f(u1, 2) . (5.62) 


5.4 Mean and Variance Transformation 187 


The generalization to the case of n variables obviously gives: 


(Z) = fii, H2,-++, Mn) - (5.63) 


The variance of Z is evaluated by considering the generalization of Eq. (5.55): 


Var[Z] = [rca — fur, w2)P px (x1, x2) dxy dx , (5.64) 


after substituting f (x1, x2) — f (41, 2) with the expansion (5.60). We then obtain: 


> | ay 5 
= fax Px (x1, x2) dx; dx2 
Ox] 


2 

+ Ea [anPexcran dx; dx2 
0x2 

2[ ewe 


= [nan Px(%1, X2) dx dx 
0X1 0x2 


2 2 
-(4] of + || of+2| slow. (565) 


Ox 0x2 ax, 0x2 


This result contains some interesting new features: in linear approximation, the 
variance of z is a function of the variances ar of the single variables, of their 
covariance 012 and of the derivatives in the points xj = f1, x2 = M2. This law 
generalizes Eq. (4.19). 

If the two variables are independent, they have zero covariance and one obtains: 


af 7? aa 
o2=0 = of= oF gpa of a (5.66) 
* 0x4 0x2 


When Z = X; + X2 is given by the sum of two independent variables, the resulting 
density is given by the convolution integral (5.32), and the explicit calculation of jz, 
and a? could be rather cumbersome, depending on the complexity of the involved 
densities. However, in this case Eqs. (5.62) and (5.66) are exact and give the result: 


Mz = Mit p2, a; =o/ +03. (5.67) 
Therefore, the mean and variance of Z are known exactly, even if the explicit form 
of the final density remains unknown or is too complicated to calculate. Therefore, 
linear transformations allow to evaluate in an approximate, but simple, way the 
dispersion of the z values around their mean, by using the criteria of the probability 
intervals and the 30 law described in Sects 3.5 and 3.10. In general, this turns out 
to be an appropriate procedure to solve the problem. Equations (5.62, 5.66) usually 


188 5 Functions of Random Variables 


give good (although approximate!) results also when Z is the product or the ratio 
of independent variables. In this case, Eq. (5.66) gives, both for the product and the 
ratio, the result: 


XxX X2 Var[Z]  Var[Xy] — Var[X2] 
Z=X,|X.,Z=—,Z=—- = 5 — F — 
X2 x (Z) (X1) (X2) 


(5.68) 
which shows that the relative variance of Z (sometime called the square of the 
coefficient of variation, CV = o/jz) is the sum of the relative variances of the 


input variables. However, this is an approximate result, as shown by the following 
exercise. 


Exercise 5.7 

Find the exact formula for the variance of the product XY of two independent 
random variables. 

Answer If the two variables are independent, we know, from Eq. (4.9), that 


the mean of a product is the product of the means. The variance of a product 
is then given by: 


Var[ XY] = fo — [x bMy)* px (x) px(y) dx dy 
= fey + wis, — 2px yxy) px (x) py (y) dx dy 
= (x?} (v2 + pre — pep? = (x?) (v?) See 
Recalling Eq. (2.67), we can write: 


Var[ XY] = (o7 + wi )(oy + 45) — mM 


oxo, + pao, ar Welee ; (5.69) 
which is the required solution. 


We can compare this equation with Eq. (5.68) if we divide both sides by 
the product of the squared means: 


CEES GES Gai (5.70) 


(continued) 


5.4 Mean and Variance Transformation 189 


Exercise 5.7 (continued) 
This last result shows that Eq. (5.68) holds only if the condition: 


(5.71) 


is verified. This happens when the relative variances are small. 


Let us now return to Eq. (5.65), and note that it can be expressed in the matrix form: 


oy O12 af/dx\ 
o; = (af/ax; Of/Ax2) =7vri. (5.72) 
012 oF af/ax2 


where V is the symmetric covariance matrix (4.62) written for the bidimensional 
case; T is the derivative matrix, also named gradient or transport matrix ; and + 
indicates the matrix transposition. 

This equation can be interpreted by stating that the variances and covariances 
of the initial variables are transformed or “transported”, by the matrices of the 
derivatives, through the function f,, to obtain the dispersion of the variable Z. Equa- 
tion (5.72) can be immediately extended to the n-dimensional case of Eq. (5.59): 


at O12... Oln af/ox, 
021 os ++ O2n af /dx2 
a; = (af/dx; Of/dx2 ... Af/Axn) 
On| On2--- a, Of /dXn 
=TVT'. (5.73) 


If all the variables are independent, the covariances are zero, and the previous 
equation directly gives the generalization of Eq. (5.66): 


n 2 
oo ® (3) a; (5.74) 


where the derivatives are calculated, as usual, at the mean values of X;. 


190 5 Functions of Random Variables 


Recall that Eqs.(5.60-5.74) are correct only for linear transformations of the 
type Z = b+ >> a; X; with a and b constant coefficients. However, they provide 
fairly accurate results even in non-linear cases, when the densities involved are fairly 
symmetrical, the number of initial variables is large and their relative variances are 
small. 

Fortunately, for complicated cases there is a method, based on simulation 
techniques, which allows the calculation of the variances of any multivariate 
function in a very simple and effective way. It will be described in detail in Sect. 8.9. 


5.5. Means and Variances for 7 Variables 


Let us now deal with the more general case of m variables Z; that are functions of 
n variables X;: 
Zi = fi(X1, X2,..., Xn) 
Z2 = f2(X1, X2,..., Xn) 
oS pecans Saha seas (5.75) 
Zm = fm(X1, X2,..., Xn) - 


If we remain within the linear approximation, the m means of the Z variables are 
obviously given by: 


(Z1) = fii, M2,--+5 Mn) 
(Z2) = for(i, H2,---5 Mn) 
eee (5.76) 


(Zin) = fim (1, M25 +++, Mn) - 


To determine the variances (and the covariances!) of the variables Z;, we need to 
generalize Eq. (5.73). The procedure does not present conceptual difficulties: it is 
necessary to start from the covariance matrix (4.62) of the variables X, to perform 
the product row by column with the transport matrices T and T* and to obtain the 
covariance matrix of the variables Z: 


V(Z) = TV(X)TT, 
(5.77) 
CovZ, 212 Sy Sy Tera Cha Lens 


5.5 Means and Variances for n Variables 191 


The covariances V(X) and V(Z) are square and symmetric matrices with dimension 
nxXnandm x m, respectively, whereas T is a transport matrix m x n given by: 


Of\/dx, Of{/Ox2 ... Of /OxXn 
Of2/dx1 Of2/dx2 ... Of2/dxXn 
T= ; (5.78) 
Ofin/0X1 Ofm/OxX2 ... Ofm/OXn 
and T? is then x m transposed matrix of T. We recall that, from Definition 4.3, 


Cov[X;, Xi] = Var[ Xi] = o;;, Cov[X;, X;] = o;;. If we introduce the compact 
notation: 


Ofi 
oF (af), (5.79) 
Xk 
we can write Eq. (5.77) as: 
Covi Zi, Zl ~ D> (8; fi on (rfid - (5.80) 


jJ=l 


This equation allows to calculate, in a fairly simple way, the dispersions of the 
variables Z and their possible covariances and correlations. For example, when the 
input variables are independent, all covariances are zero, that is, oj; = 0 fori # j 
and Eq. (5.80) becomes: 


Cov[Z;, Z,] = a (0; fi) Oj; (Oj fx) - (5.81) 


j=l 


Since it is often necessary to calculate covariances of functions of random variables, 
we want to show you in detail how to do it. We will describe the case of two variables 
Z1, Z2 and X,, X2, because the generalization to functions containing a greater 
number of variables is obvious. 

The problem consists, once the variances and covariances of X;, Xz have been 
obtained from the data, in determining the variances and covariances of the variables 
Z1, Zz. AS a matter of fact, everything is implicitly contained in Eq. (5.80). The 
covariance Cov[Z1, Z2] is then given by: 


2 
Cov[Z1, Zo] ~ D> (8; A) oj (1 f2) 


ji=l 


= (a fidop (fo) + 


192 5 Functions of Random Variables 


(91 fi) O12 (02 f2) + 
(02 fi) 021 (01 f2) + 
(82,fi) oF (82 fa) 
= (1 fi) (01 f2) op + (82, fi) (2 f2) oF + (5.82) 
[(01 fi) (02 fa) + (02 fi) (01 f2)] O12, 


where the equality 012 = 02; has been used in the last row. 

Matrix notation is convenient and compact, but we remind you that Eq. (5.82) can 
also be directy demonstrated, from the covariance definition (4.24), by expanding 
the z variables around their mean as Taylor series up to the first order. In this way, 
one has: 


Cov[Z, Z2] = ((Z1 — (Z1)) (Z2 — (Z2))) (5.83) 
~ ([(01 fr) AX + (02 fr) AX2 | [01 f2)AX1 + (02 fx) AX2]) . 


Going on with the calculation and taking into account that: 


((4x1)") =o, 
((Ax2)*) = 03, 


((AX1)(AX2)) = o12 = 021, 


Equation (5.82) is again obtained. 


Exercise 5.8 
Two random variables Z; and Zz depend on two standard independent 
Gaussian variables X and Y according to the functions: 


Z,;=X+3Y, 
Z2=5X4+Y. 


Find the linear correlation coefficient between Z; and Z2. 


Answer Even if X and Y are independent, the functional link creates a 
dependence between Z; and Zo. 


(continued) 


5.5 Means and Variances for n Variables 193 


Exercise 5.8 (continued) 
Defining X; = X and X2 = Y and using the notation of Eq. (5.79), one 
easily finds: 


(i fi) =1, (02 fi) =3, 


(01 f2) =5, (02 fo) = 1. 


To determine the linear correlation coefficient of Eq. (4.31), it is, at first, 
necessary to find the variances and the covariance of Z; and Z>. 

Since the input variables are independent, to evaluate the variances of Z, 
we just need to apply Eq. (5.81) and keep in mind that the standard variables 
have unit variance: 


Var[Z1] = (1)? o? + (3)? 67 = 10, 


Var[Zo] = (5)? o? + (1)? of = 26. 


Since X and Y are independent standard random variables, the covariance 
between the Z variables is evaluated through Eq. (5.82), with a; = oF = || 
and o 12 = 0: 


Cov[Z1, Z2] = 5- 1)o7 +(3- los +(1-143-5)o =543=8. 


From Eq. (4.31), one finally obtains: 


8 
Z|, Z2) = ———— _ = 0.496. 
pP[Z1, Z2] Tio Ji 


Exercise 5.9 
Two random variables X and Y have known mean, variance and covariance: 
Lx, [y, oe, we Oxy. The transformation: 


fiji, = SM ae 
bj = Mo 


is applied to them. Find the covariance Cov[Z1, Z2] between Z; and Z2. 


(continued) 


194 5 Functions of Random Variables 


Exercise 5.9 (continued) 

Answer Let 41 and j42 be the means of Z; and Zo, respectively. Using 
Eq. (5.83) and from a series expansion of the two new variables as a function 
of the old ones, we obtain: 


Cov[Z1, Z2] = ((Z1 — 1) (Z2 — 42)) 


0Z, OZ, dZo 0Z2 
= ( (| — AX + — AY ] | — AX + — AY 


ax ay ax ay 
az aZ Wa as 

es (‘4 ) Ce eA AY 
De aie ax aY 

0Z, 022 dZ, OZ2 2 

SA AV ee AY. 

ay ax | 1+ Wy oy ( y) 


= 5Y Var[X]+ X Var[Y] + (5X + Y) Cov[X, Y] . (5.84) 


Since this expansion is made around the mean values, the variables X and Y 
appearing in the derivatives are the mean values of jz, and jy. Since these 
values are known, the problem is solved. 


Exercise 5.10 

The measured coordinates of a point in the x-y plane are considered to be 
random variables with standard deviations equal to 0.2 cm (for x) and 0.4 cm 
(for y). These variables are uncorrelated. Determine the covariance matrix in 
polar coordinates at the point (x, y) = C, 1). 


Answer Since the coordinates are uncorrelated, the covariance matrix of the 
original variables is: 


0.04 0 
Viy = 
0 0.16 


The transformation to polar coordinates is given by: 
r=x?+y2, y = arctan . 
x 


(continued) 


5.5 Means and Variances for n Variables 


Exercise 


5.10 (continued) 


The transport matrix of this transformation is: 


which, at the (x, y) = (1, 1) point, corresponding in polar coordinates to 


(r7,Q)= 


The prob 


xy 
: 
_— ; 
y x 
22 


G2, a/4), becomes: 


slbgels 
V2 J2 
Te ; 
ee 
2 3} 
lem is now solved with Eq. (5.77): 
i i _i 
i Va 0.04 0 vO 0.100 0. 
Vi.@ = = 
i ii alee 
-5 5 0 0.16 wa 0.042 0. 


The square root of the non-diagonal elements of V, gives the standard 


042 


050 


deviations of the variables r, g. Therefore, given the measurement: 


x=1+02cm, y=1+04cm, 


the transformation to polar coordinates provides the values: 


eae 


In addition, the non-zero off-diagonal elements of the matrix signify that the 
variable transformation has introduced a positive correlation between r and 


Q. 


-/0.100 = 1.4140.32cem, y= < +V0.050 = 078: 


+ 0.22 rad . 


The problem can also be solved with the following R commands: 


= Seip yells 


> r=sqrt (x*2+y*2) ; 


> Vxy <- matrix(c(0.04,0.,0.,0.16) .byrow=T,ncol=2) 


> T <- matrix(c(x/r,y/r,-y/r*2,x/r*2) , byrow=T,ncol=2) 


S> UD) 


= jeu) # TD is the transpose 


> Vrphi <- T %*% Vxy %*% TD # %*% is the row/column multiplication 


195 


196 5 Functions of Random Variables 


Exercise 5.11 
Prove that the covariance between the variables (J;, /;) of the multinomial 
distribution is given by Eq. (4.92). 


Answer In a multinomial distribution, there exists a correlation between the 
bin contents {J; = n;}: 


fim) = (m1 +ng+--- +m) HN. (5.85) 


Since N is fixed, Var[V] = 0. If one applies transformation (5.80) to 
Eq. (5.85), the result: 


VarlN] = 0 (if) oj (i /) 
ij 
= es - ae 4= ac, =0, 
ij i AJ 


is obtained. From the last line of this equation, and from Eqs. (4.91), one then 
gets: 


yo oij =-— D207 = -— ) Npid — pi) = — Nip; . 
i 


ixj i tAJ 
where the condition: 
d-p)=> pj, G47 
j 
has been taken into account. The result of Eq. (4.92) is thus derived: 


Cov[ii, 1j] = oi; = —Npip; - (5.86) 


Finally, remember that all the limitations of the linear approximation discussed 
above apply to the results of this section. 


5.6 Problems 197 
5.6 Problems 


5.1 Find the p.d.f. of the variable Y = —21In X where X ~ U(0, 1) is uniform. 


5.2 Find the p.d.f. of Z = X?, where the density of X is px(x) = 21-—-~x), O< 
x <i. 


5.3. Find the density pz(z) of the variable Z = X/Y from the known joint density 
Pxy(x, y). 


5.4 The densities of the indipendent variables X and Y are py (x) = exp[—x], x => 
0 and py(y) = exp[—y], y => 0, respectively. Determine the density of Z = X+Y. 


5.5 Find the density of X = )~/_, Tj, where the variables 7; are independent 
random times with negative exponential p.d.f.. 


5.6 The independent variables X and Y have densities py(x) = exp[—x], x > 0 
and py(y) = exp[—y], y => 0, respectively. Determine the density pzw(z, w) of 
the variables Z = X/(X + Y), W = X + Y. Try to comment on the result. 


5.7 Find the density of Z = XY, where X and Y are two independent uniform 
variables ~ U (0, 1), and calculate (Z) and Var[Z]. 


5.8 Two devices T; and 72, both having a mean life 1/4, work in parallel. The 
second device comes into operation only after the failure of the first. If the operating 
time of the two devices follows the exponential law, find the p.d.f of the operating 
time T of the system and its mean life. 


5.9 Two random variables Z; = 3X + 2Y and Z, = XY are given, where X 
and Y are independent random variables. Find the mean, variance, covariance and 
correlation of Z; and Zz when (a) X and Y are standardized and when (b) X and Y 
have unit mean and variance. 


5.10 A company produces both shafts, whose diameter is a Gaussian variable with 
parameters (X) = 5.450 and o[X] = 0.020 mm, and bearings, whose internal 
diameter is also a Gaussian variable with parameters (Y) = 5.550 ando[Y] = 0.020 
mm. The shaft must be seated within the bearing and the coupling is acceptable 
when the shaft/bearing clearance is between 0.050 and 0.150 mm. Determine the 
percentage of discarded assemblies. 


5.11 On average, jx vehicles transit from A to B in a given time unit, and A vehicles 
do the same in the opposite sense, from B to A. Find the p.d_-f. of the total number NV 


of vehicles and the probability to observe k vehicles from A to B over a total of n. 


5.12. Verify the numerical values obtained in Exercise 5.8 with simulated data. 


Chapter 6 ®) 
Basic Statistics: Parameter Estimation Cheek for 


In which Alinardo seems to give valuable information, and 
William reveals his method of arriving at a probable truth 
through a series of unquestionable errors. 


Umberto Eco, ““THE NAME OF THE ROSE”. 


6.1 Introduction 


In the previous chapters, we have introduced the main results of probability theory. 

Let us now enter the fascinating world of statistics by starting with the funda- 
mental question: what is statistics and how does it differ from probability theory? A 
first answer can be obtained by carefully considering the following two points: 


¢ A probability problem: if we attribute to a coin a true probability equal to 1/2 
of getting head in a flip, what is the probability of getting less than 450 heads in 
1000 flips? This problem has been solved in Exercise 3.13. 

¢ The same problem in statistics: if 450 heads are obtained in 1000 coin flips, what 
is the estimate that can be given of the true probability of getting heads, that is, 
the one that would be obtained in an infinite number of flips? 


As can be seen, in the probabilistic approach, a model distribution is assumed to 
be the true one describing the studied process. Then the probability of obtaining 
a certain experimental result is estimated on the basis of this premise. In the 
statistical approach, instead, starting from the experimental value, an interval must 
be evaluated to determine the true value of the probability. 

At this point we realize that this estimate lacks of an essential ingredient: the 
statistical equivalent of the standard deviation. Here we anticipate an approximate 
result that will be discussed in the next sections: often in statistics the estimation 
of the standard deviations can be performed by substituting the true parameters 
with the measured ones: 0 ~ s. This procedure is sometimes called error plug- 
in. The estimated standard deviation s thus defined is often called, by physicists and 
engineers (and generally by all who regularly perform laboratory measurements), 
as statistical error. At an international level [fSI93], the recommended term for 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 199 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_6 


200 6 Basic Statistics: Parameter Estimation 


Table 6.1 Difference between probability theory and statistics in the simple case of x = 450 
successes inn = 1000 coin tosses 


Probability theory Statistics 

Probability of spectrum values _| Parameter estimate (p) 

True probability: p = 0.5 | Frequency: f = x/n = 0.45 
Expected value: (X) = 500 Measured value: x = 450 
Standard deviation: | Statistical error or uncertainty: 
o[X] = Jap —p) = 15.8 iSa/af= f= 147 


measurement results is statistical uncertainty. We therefore have three synonyms: 
estimated standard deviation (mathematical term), statistical error (language of 
physicists) and statistical uncertainty (term recommended internationally). In the 
following, we will mainly use statistical error. 

Using the formulae of Exercise 3.13, we get Table 6.1, which provides the 
intervals: 


fto = 500.0 + 15.8 ~ 500 + 16 = [484,516] (probability theory) , 
x ts = 450.0+ 15.7 ~ 450 + 16 = [434, 466] (statistics) . 


Despite the apparent analogy, these two intervals have a very different meaning: the 
first, assuming jz and o to be known, assigns a probability to a set of values of X, 
while the second provides an estimate for the value of ju. 

The above example refers to the estimation of a true unknown parameter starting 
from the data. As we will see, this operation is performed using the consistent 
estimators, defined in Eq. (2.77) as random variables Ty (X). They are functions 
of a random sample of size N and converge in probability to a given value. We 
remind you the Definition 2.12 of random sample. 

Hypothesis testing, the other field of application of statistics, tries instead to 
answer questions like the following: if the experiment consisting of a thousand 
coin tosses is repeated twice and 450 and 600 heads are obtained, how likely 
is it that the same coin was used in both experiments? In the next chapter, this 
topic is described in the simplest case, of rejection or acceptance of a single initial 
hypothesis. Further on, in Chap. 10, we will explain how to optimize the choice 
among several alternative hypotheses. The concepts so far exemplified with the 
coin toss can be precisely defined by using, with a slightly different notation, the 
probability space of Eq. (1.11): 


E(0) = (S, F, Po), (6.1) 


where the probability Pg depends on a parameter 0. Then, the random sampling 
(X,, X2,..., Xn) follows the law: 


P{X € A} = 1 p(x; 6) dx. (6.2) 
A 


6.2 Confidence Intervals 201 


In the case of a coin toss, the probability is discrete and p(x; 6) = b(x; 1000, p), 
with 6 = p. Therefore, we will look for an estimate of 0, and we will see how to 
perform the hypothesis test on the 6 parameter, formalizing what has been intuitively 
explained Exercises 3.13-3.17. 


6.2 Confidence Intervals 


In the introductory example of Table 6.1, we have intuitively defined an interval 
of size 2s to estimate, on the basis of the result of a thousand tosses, the expected 
number of heads as: 


nt J/nf(l—- fp ~ 450+ 16. 


Now imagine repeating the same experiment to obtain a new interval, which, before 
performing the coin tosses, is clearly a random interval that we could define as: 


X+S. 


In statistics, the probability for this interval to include the true probability p of 
getting heads is called confidence level and denoted by CL = (1 — a), while @ is 
called significance level. In the ideal case, a satisfactory interval should both have 
a high confidence level and a small width. To meet these requirements, let us try to 
define a criterion for choosing an interval with the desired CL following the so- 
called frequentist interpretation, adopted in the experimental sciences on the basis 
of a famous work published by the statistician J. Neyman in 1937 [Ney37]. 

Let us consider the density p(x; 6) of known functional form, and suppose we 
want to determine the unknown value of the parameter @. If @ is a position parameter 
such as the mean, for different values of 0, the density will shift along the x axis, as 
shown in Fig. 6.1. The unshaded areas of the densities in Fig. 6.1 correspond to the 
probability levels: 


x2 
P{x, < X <x2;0} = t-a=/ P(x; 6) dx , (6.3) 


Xx] 


where @ is the sum of the areas of the two shaded tails. The union of all the intervals 
[x1, x2] of Fig. 6.1 creates a region, in the (x, 8) plane, called confidence band with 
confidence level CL = 1—a. This band, looking at Fig. 6.1 from above, shows up as 
in Fig. 6.2. The two curves delimiting it are increasing monotone functions 6) (x) and 
62(x). In general, this property is not true, but it can be restored by reparametrizing 
the problem (e.g., by using, as parameter 0, the mean instead of the probability per 
unit of time when dealing with negative exponential distributions). 

Always keeping in mind Figs. 6.1 and 6.2, suppose now to have measured X and 
obtained a value x. By tracing the line passing from this point and parallel to the 


202 6 Basic Statistics: Parameter Estimation 


Fig. 6.1 Confidence interval of level CL for the central parameter 6. Unshaded light areas of 
P(x; @) are equal to CL. A variation of the 6 value changes the position of p(x; @) along the axis 
of the measured values x. Taking into account the variation of the 6 parameter along the vertical 
axis, one obtains the displayed pattern, where the horizontal width of the Neyman confidence band 
[81 (x), 82(x)] corresponds to an area equal to CL under p(x; 6). When an experimental value x is 
obtained, the points of intersection between the confidence band and the line passing through x and 
parallel to the 6 axis determine the interval [6) , 62] containing the true value of 6 with a confidence 
level CL = 1 —a@, where a = c; + Cp is the sum of the two shaded areas 


cy 0,(x)=x,(0) 
0,(«)=x,(8) 
6,° a 
| 
t) 
Oe 2 
x 
xX] rs x 2 


Fig. 6.2. Looking at Fig. 6.1 from above, the Neyman’s confidence band for a fixed CL shows up, 
which allows to determine the confidence interval 6 € [0), 62] starting from a measured value x 


6.2 Confidence Intervals 203 


parameter (vertical) axis, the intersection interval [0), 02] with the confidence band 
is obtained. This is the required confidence interval. We note that this is a random 
interval, since the measured value x varies with each observation. Therefore, we 
will denote it as [©,, ©2]. If the true value of the unknown parameter is 6, the 
previous figures show that this procedure leads to the interval [x1, x2] on the axis 
of the measured values. By construction, we then have P{x; < X < x2} = CL. 
Given that when x = x; on the parameter axis 9 = 62 and when x = x2 the 
condition 0 = 0; holds, we have x € [x1, x2] if and only if 8 € [6), 62]. From 
these considerations, we finally arrive at the fundamental property of the Neyman 
confidence interval [©,, ©2] with confidence level CL = 1 — a: 


P{x, < X <x} = P{O; <9< O@J=CL. (6.4) 


We can summarize the previous discussion with the 


Definition 6.1 (Confidence Interval) Given two statistics @; and @> (in the sense 
of Definition 2.13) with ©; and ©) continuous variables and ©; < ©» with 
probability 1, 7 = [© , ©] is called a confidence interval for a 9 parameter, of 
confidence level 0 < CL < 1, if, for each 0 belonging to the parameter space, the 
probability that J contains 6 is CL: 


P{O, <9 <@}=CL. (6.5) 


If ©; and @» are discrete variables, the confidence interval is the smallest interval 
satisfying the condition: 


P{O; <9<@.}=>CL. (6.6) 


To better highlight the concept of interval “covering” the parameter @ with a certain 
probability, the confidence level is associated with the terms “coverage” or coverage 
probability [CB90]. The condition (6.6) is said to be minimum over-coverage. The 
confidence level therefore coincides with the coverage only for continuous variables, 
and the equal sign in Eq. (6.6) refers to this situation. On the other hand, for discrete 
variables, the minimum interval ensuring a coverage greater than the requested one 
must be determined. 

Another important feature to be noticed is that the extremes of the confidence 
interval (6.5) are random variables, while the 6 parameter is fixed. Consequently, 
the confidence level refers to the interval J = [©,, ©2] and indicates the fraction 
of experiments that correctly include the true value, in an infinite set of repeated 
experiments, each of which finds a different confidence interval. This is equivalent 
to stating that each particular interval [6), 62] is obtained with a method that gives 
the correct result in a fraction CL of the performed experiments. 

This frequentist interpretation of the statistical results has a clear operational 
meaning, very close to what practically happens in laboratory measurements and 
generally in repeated experiments, and is prevalent in applied sciences. In statistics, 


204 6 Basic Statistics: Parameter Estimation 


it is legitimate to denote as confidence interval both the random interval J = 
[@©,, @2] and its numerical realizations [6;, 02]. It is sometimes stated that the C L is 
the probability that the true value of 6 is contained in [0), 02], but one must always 
be aware that, in the frequentist interpretation, this phrasing is incorrect, because 0 
is not a random variable [Cou95]. 

Let us now assume that @ is a position parameter such as the mean. If a value x 
of a continuous variable X has been obtained and the density p(x; 0) of Eq. (6.2) is 
known, the values [6), 62] of the random interval [©,, ©2], for a given CL, can be 
evaluated with the relations (pay attention to the position of 6; and 62): 


CO x 
/ p(z;@)dz=c,, / D(z; 62) dz = c2, (6.7) 
x as 


[ee 


where CL = | — cj — cp. The procedure is also shown in Fig. 6.1. If X is a discrete 
variable, the integrals must be replaced by the sums over the corresponding spectrum 
values, as we will shortly see. If a symmetric interval is chosen, then the conditions 
Cc, = cz = (1—CL)/2 are valid. The choice of a symmetric interval is obviously not 
the only possible one, but it is the most common, since it gives the minimum width 
interval for a symmetric and bell-shaped p.d.f. Sometimes one wants to determine, 
for a certain CL, only the upper limit of 0, i.e. the interval (—oo, @]; in this case, 
only the second of Eq. (6.7) is used, where 02 = Oy and co = | — CL. For the 
lower bound, i.e. for the interval [0, +00), the first of Eq. (6.7) must be used with 
the conditions 6; = 6, and cj = | — CL. The three main types of estimate just 
described are displayed in Fig. 6.3. 

The choice of the interval type is usually determined by the nature of the specific 
analysed problem. It is however important to keep in mind that this choice must be 
made before performing the measurement. To decide, for example, in the case of 
rare events, if to determine a symmetrical interval or an upper limit depending on 
whether the measurement provides results or not (a technique called “flip-flop”’) 
leads to an incorrect determination of the levels associated with the confidence 
intervals [FC98]. 

Notice that Eq. (6.7) has a general validity since, quite often, the evaluation of a 
confidence interval requires a consistent estimator Ty satisfying Eq. (2.77) (where 
jt = 6). The density of Ty usually depends on the parameter to be estimated and can 
be denoted as p(t; 6), with 6 considered as a position parameter. Indeed, (Ty) = 
i. t p(t; 0) dt = t(@), with t(@) = 9, for unbiased estimators (these concepts will 
be explored further on, in Chap. 10). If this density is found, the estimate of 6 can 
be performed through Eq. (6.7). 


6.3 Confidence Intervals with Pivotal Variables 


So far, we have described two ways to determine the confidence interval: the 
simple graphical method of Fig. 6.1, which however requires the construction of 


6.3. Confidence Intervals with Pivotal Variables 205 


Fig. 6.3. Determination, using the Neyman method, of the bilateral (two-tailed) confidence 
interval given by Eqs. (6.7) (a), of the upper bound 6y (b) and of the lower bound 67 (c) at a 
measured value x. The sum of the areas of the two shaded tails in (a) is 1 — CL, while each tail in 
(b) and (c) has a value 1 — CL. Therefore, using the same confidence level, one has the situation 
shown in the figure, with 62 > @y and 6; < 6 


the Neyman confidence band, and the calculation of the integrals (6.7). In general, 
both of them are computationally demanding. Fortunately, if the shape of the 
density function has some invariance properties with respect to the parameters to 
be estimated, a particularly simple method can be used. Consider the case where 
the parameter 6 of Fig. 6.1 is the mean of a Gaussian, 6 = ju. Then, the shape of 
p(x; /£) is invariant by translation, and the functions 0) (x) and 62(x) of Fig. 6.2 are 
two parallel straight lines. From Fig. 6.4 the following property results: 


u+to X+to 
/ p(x; ) dx a Plu; x) dy , (6.8) 
ML 


—to x—-to 


that can be written as: 


P{w—to <X <ut+to} = P{-to < X —p<to}= P{X —to <w<X+to}. 
(6.9) 


206 6 Basic Statistics: Parameter Estimation 


Fig. 6.4 When the shape of 
the density is invariant for 
translation of the jz 
parameter, the confidence 
interval can be determined 
with the simple formula (6.9) 


CL 


-to to 


In other words, the probability levels of the interval centred about 1 coincide 
with the confidence levels of the random interval centred about X. The value of 
X changes at each measurement, but the coverage probability of jz is equal to that 
of the interval yz + to (see Fig. 6.4). Using the cumulative function F(z; 6) (when 
6 and z are scalar quantities), this property can be written as: 


F(z:0) =1— F(@:2). (6.10) 


We therefore have found the following rule of thumb: in the Gaussian case, or in 
all cases when Eq. (6.10) holds, it is sufficient to centre on the measured value and 
assume, as confidence levels, the probability levels corresponding to the width of 
the interval centred on the mean. 

For example, we know, considering Table 6.1, that the number x of successes in 
n = 1000 coin flips is a Gaussian variable (because np, n(1 — p) > 10). Then, we 
can estimate the true or expected value of successes as: 


x+J/nf(l— f) =450+16 (CL = 68.3%) 
xt2/nf(l— f) = 450432 (CL =95.4%) 
x+3/nfl— f) =450£48 (CL=99.7%). 


As we have just seen, the random variable Q = (X — jz) includes the parameter 
w but has a distribution N (0, 0°), independent of w. The random variables whose 
distribution does not depend on the parameter to be estimated are called pivotal 
quantities. Another example occurs when @ is a scale parameter, that is, p(x; 0) = 
h(x/@)/@; in this case Q = X/6 is a pivotal quantity for 6. 


6.4 Mention of the Bayesian Approach 207 


Generalizing this argument, we can state that, if Q(X, @) is a pivotal quantity, the 
probability P{Q € A} does not depend on 6 for each A € R. If this distribution has 
a known density (standard Gaussian, Student, x7, ...), the quantiles q; and g2 can 
be easily determined at a given CL as: 


Pig < O(X,0)<q@j=CL. (6.11) 


If the condition gq; < Q(X,0) < q2 can be solved for 6, one can write, as in 
Eq. (6.9): 


P{qi < O(X, 9) < qo} = P{O\(X) < 6 < &2(X)} = P{O, <6 < Oy}, 
(6.12) 


obtaining Eq. (6.5). Therefore, if a pivotal quantity is found which is solvable with 
respect to 0, according to Eq. (6.12), the confidence interval can be determined 
without resorting to integrals (6.7). This simple procedure is often the only one 
reported in elementary texts. 


6.4 Mention of the Bayesian Approach 


We have just described the basic frequentist method for parameter estimation. 

In the Bayesian approach, briefly described in Sect. 1.3, the parameter to be 
estimated is considered as a random variable and the confidence interval represents 
the knowledge obtained, after the measurement, on the value of this parameter. Let 
us again assume that 450 heads are obtained in a thousand coin flips. Under the 
normal approximation and assigning a constant a priori probability to the expected 
value, it turns out, after the experiment, that the expected value is 450 + 16 with a 
degree of credibility (probability or belief) of 68%. 

In this case the Bayesian approach provides a numerical result equal to the 
frequentist one but interpreted in a different way, since the Bayesian interval 
depends on a priori information. In the case of a fair coin (p = 0.5), with an 
uncertainty on the Gaussian balance of, say, op/p = 0.1%, we could replace the 
uniform a priori distribution (constant probability) with the Gaussian distribution 
N(p, Cran obtaining a result that is numerically different from the frequentist one. 
We will not elaborate more on these aspects that are treated in detail, from a 
statistical point of view, in [CB90] and [Gre06]. 

Finally, we recall that Bayesian analyses have been proposed in physics when it 
is not easy to find pivotal quantities, as in the case of small counting experiments 
with background or of samples from Gaussian populations with physical constraints 
on the measured variables (e.g. if X is a mass, the a priori condition {X > 0} holds) 
[Cou95, D’A99]. However, even for these situations, a frequentist approach has been 
proposed [FC98], which does not require a priori assumptions on the parameter 
distribution and which has met with the favour of experimental physicists [JLPe00, 
LW18]. 


208 6 Basic Statistics: Parameter Estimation 
6.5 Some Notations 


In the following it will be important to keep in mind the notations used for point and 
confidence interval estimations. The point estimation of a true value of statistical 
parameters is obtained using estimator values; for example, the sample mean is a 
possible estimate of the true mean. The notation we will use is: m = 1. We will 
extensively describe point estimation in Chaps. 10 and 11, while in this chapter we 
analyze in detail interval estimation. In the Gaussian case, the value @ is contained, 
with confidence level 1 — a = CL, in a symmetric interval centred around x when: 


0 € [x — t_-a2s, X + h-a/25] , (6.13) 


where s ~ o and ty/2 = —t—q/2 are the standard Gaussian quantiles. It is easy to 
verify that the quantile indices can be written in terms of CL as: 


1—CL 1+CL 
Coe ee (6.14) 
2 2 2 2: 
In statistics, the lo interval is often written as: 
de[x—s,x+s]=xuts. (6.15) 


The first notation is preferred by mathematicians, the second one by physicists and 
engineers, who often replace the set membership symbol with that of equality: 


physicists and engineers 


dexts — @=x+ts. (6.16) 


If the errors to the right and left of the central value are different, the notation of 
mathematicians obviously does not change, while the other one becomes: 


6e[x—s,x+5.] =xt”. (6.17) 


It is usually considered improper to assign more than two significant digits to s. For 
example, if the first significant digit of the error s corresponds to a metre, it makes 
no sense to give the result with millimetre precision. The following rule of thumb 
applies, which we report here as: 


Statement 6.2 (Significant Digits of the Statistical Error) Final results of the 
type x + s must be presented with the uncertainty (statistical error) s given with 
no more than two significant digits and x rounded in the same way. Deviations from 
this rule must be justified. 


Notice that, in intermediate calculations, more digits can be used to reduce round- 
off errors; however, in the final results, the rule 6.2 should be always followed. 


6.6 Probability Estimation 209 


Therefore, the following results are wrong: 


35.923+ 1.407, 35.923+1.4, 35.94 1.407, 


because the first result has too many significant digits, the second and the third ones 
exhibit a mismatch between result and error. On the contrary, it is correct to write: 


35.9414 or 3641. 


6.6 Probability Estimation 


Here we consider the following problem: if a Bernoulli test with n trials and x 
successes is performed and a frequency f = x/n is obtained, what is the estimate of 
the true probability? From probability theory we know that, after 7 trials on a system 
that generates events with constant probability p, we can obtain a number x of 
successes (spectrum) between 0 and n. However, these results are not equiprobable 
but are distributed according to the binomial density (2.29). 

We should use Eq. (6.7) to solve this problem for discrete binomial random 
variable. If CL is the required confidence level, the values p; and p2 of the 
corresponding confidence interval are determined by using the two distributions of 
Fig. 6.3 a) (where 6; = np, and 62 = npz2). These values can be determined with 
the so-called Clopper-Pearson equations: 


> (Z) eta por =c1, (6.18) 


k=x 


x 


Dy (") p3(1 — pr)" * =e. (6.19) 


k=0 


The presence of x in both sum assures the over-coverage condition of Eq. (6.6) for 
discrete variables. One common choice is the symmetric interval, where cy = cz = 
(1 — CL)/2 = a/2. The solution of these two equations with respect to p; and p2 
gives the correct probability estimate from small samples. 

The general scheme shown in Fig. 6.3 applies to the determination of the upper 
and lower limits for a predefined CL. In the case of a discrete binomial distribution, 
it becomes the one shown in Fig. 6.5. 

When x = 0 and x = n Eqs. (6.18, 6.19), with c) = cz = 1 — CL, give two 
important limiting cases: 


,=2—> pp, H1—CL, (6.20) 
0. Gp) S78. (6.21) 


210 6 Basic Statistics: Parameter Estimation 


np 


lower limit 


Fig. 6.5 Probability estimation of lower and upper bounds for a predefined CL. In the limiting 


cases where x = 0 and x = n, one has (1 — p2)" = 1 — CL and pj = 1 — CL, respectively 


From these equations one obtains, for a fixed CL, the lower bound of a probability 
when all attempts have been successful: 


pi = VI—CL = en -CL) = gq los-C) | (6.22) 
and the upper limit when no success has been recorded: 
pr =1— YT CL = 1— en BU-CL) = 1 — gn loa-CL) | (6.23) 


where the use of base-10 or base-e logarithms is useful for large n. 

The frequentist interpretation of these limits is that, if the true value were greater 
(less) than the upper (lower) limit, we would obtain values < (>) than those 
observed in a fraction of experiments < CL. 

When n is large and no successes have occurred, pz is small. Expanding to the 
first order the exponential in Eq. (6.23) around the starting point p2 = 0, we obtain 
the approximation: 


1 
p2 ~—-—In(1—CL), (6.24) 


n 


corresponding to the equation: 


e "2 —(1—CL)=a, (6.25) 


6.6 Probability Estimation 211 


which gives the Poissonian probability of getting no events when the mean is “ = 
np2. 

The formulae just obtained are implemented in the R routine binom.test 
(x,n,conf,alt),wherex, n, conf and alt are the number of successes, 
the number of trials, the required confidence level (default value conf= 0.95) and 
the type of interval, respectively. For example, prop.test (5.20,conf=0.90) 
returns the values [0.104, 0.455] in the last four lines of the output message. Further 
messages (not shown here) refer to the test with a binomial having p = 0.5 and 
should be ignored for the moment. This routine normally provides the Clopper- 
Pearson bilateral interval (6.18, 6.19), because the variable alt is initialized 


as alt = “two.sided”. To obtain the upper limit, one has to set alt = 
“less”, while to get the lower limit the command is alt = “greater”. It 
is instructive, for a given CL, to obtain the values for alt = “two.sided”, 
“less” and “greater” and check the situation described in Fig. 6.3. 


Exercise 6.1 

From an urn containing five black and white marbles in unknown proportions, 
ten extractions are performed (with replacement), and ten black marbles are 
extracted. Find the lower limit of the number of black marbles in the urn for 
CL = 0.90. Compare the results with those of Exercise 1.6. 


Answer From Eq. (6.22), we get p = (0.10)!/!° = 0.794. The lower 


limit for the number of black marbles is 0.794 -5 = 3.97. There- 
fore, we can state that the urn contains at least four marbles with 
CL = 0.90. This result can be obtained also with the R command 


binom.test (10,10,conf=0.90,alt="greater’). An ur with 
fewer than four black marbles can result in ten consecutive draws of 10 black 
marbles, but this happens in less than 10% of the experiments. 

It is interesting to compare this frequentist solution with the Bayesian 
result given in Table 1.2: the frequentist estimate is independent of any a priori 
subjective hypothesis about the initial marble content. Subjective hypotheses 
usually affect the final results, as shown by the results of Exercise 1.6 and 
Problem 1.12. 


Equations (6.22, 6.23) are important in many reliability problems, as the 
following examples show. 


212 6 Basic Statistics: Parameter Estimation 


Exercise 6.2 

An emergency pump undergoes a reliability test consisting in 500 “cold 
starts”. If the pump passes the test, what is the probability that it will not 
start in an emergency situation with a confidence level of 95%? 


Answer From Eq. (6.23), one immediately obtains the upper limit: 
pol l= Cn— = 4/005 = 0 00597 


that is about 0.6%. The same result can be obtained with the R command 
binom.test (0,500, conf=0.90,alt=”"”less”). Neyman’s interpre- 
tation of the result is as follows: a pump with a probability of failure greater 
than 6 per thousand can start 500 consecutive times, but this happens in less 
than 5% of tests. Note that this result holds true in the independent tests 
scheme. 


Exercise 6.3 
How many consecutive non-failure tests are required to affirm, with CL = 
95%, that a device will fail in less than 3% of times? 


Answer From Eq. (6.23) using decimal logarithms, one obtains: 


_ legilh— cL) logit — 0:95) 


= Sep ch a Ey 1 (6.26) 
log(l—p)  log(1 — 0.03) 


that is about 100 tests. A device with a probability failure > 3% can 
successfully pass 100 tests, but this happens in less than 5% of the times. 


6.7 Probability Estimation from Large Samples 


The Clopper-Pearson formulae (6.18, 6.19), derived in the previous section, are 
completely general and are valid for both small and large samples. However, they 
are mathematically laborious to solve in the unknowns p; and p2 and require the 
use of the R software, so that often approximate formulae are used. 

Indeed, we know that the binomial distribution, for np, n(1 — p) > 10, rapidly 
assumes the Gaussian form of mean value np and variance 0? = np(1 — p). It 
is therefore extremely important and useful to have simple formulae in Gaussian 
approximation, which provide practically exact results for large samples. 


6.7 Probability Estimation from Large Samples 213 


Consider the frequency f = x/n as the occurrence of the random variable F = 
X/n, which, from Eqs. (2.64, 3.4, 3.6), has mean and variance: 


(X) np 
(F) =—_=— =p, (6.27) 
n n 
Var[ X 1- 1- 
i> ee ee (6.28) 
n n n 
and define the standard variable (3.37): 
F- a 
T= , Which assumes the values t¢t = i (6.29) 
o[F] 


[pa —p)- 
n 


Under the Gaussian approximation, T can be considered pivotal, and we can thus 
apply the method described in Sect. 6.3. Since f is known and p is unknown, using 
the statistical approach, we can then determine the values of p for which the value 
assumed by the standard variable is less than a certain assigned quantile r: 


lf — p| 


/pQ— p) 
n 


We eliminate the absolute values by squaring both sides and solve, with respect to 
the unknown p, the resulting second degree equation: 


2 (f — p)?? 2 = 
t =F0— pyr’ (f—p) St oy 


(1? +n)p* —(t? +2fn)p+nf? <0. 


< (cl). (6.30) 


Since the p* term is always positive, the inequality is satisfied for values of p in the 
range: 


_ +2 fn) + Vit + 4p a? + Ataf — anf? — an? f? 
Dp ny 


2(t2 +n) 


3 


from which, in a compact form, one obtains the Wilson formula: 


a Ye Jaf) 
to) —- 
ft a ane + n 
pe i (6.31) 
t t 
—t+1 —+1 
n n 


The ¢ parameter indicates any value of the standard variable T; the value t = 1 
corresponds to one standard deviation. Note that the interval is not centred on the 
measured frequency f but at the value (f + t7/2n)/(t?/n + 1), which is a function 


214 6 Basic Statistics: Parameter Estimation 


of f and of the number of trials performed. This effect is a consequence of the 
binomial density asymmetry for small n, as seen in Fig. 2.4. For n >> | the interval 
tends to be centred around the measured frequency and Eq. (6.31), for CL = 1—a, 


becomes: 
] = 
pe fetaps=fttap (2, (6.32) 


where |fa/2| = ti—a/2 are the standard Gaussian quantile values corresponding to 
the extremes of the interval with an area under the curve of CL = 1— a. 
Usually the lo interval is reported with ty-y/2 = 1: 


pefts=ft fa-) (6.33) 


n 


which is named Wald interval. This interval can also be obtained directly from 
Eq. (6.30) by replacing in the denominator the true error with the estimated one 
/ fd — f)/n, a technique sometimes called error plug-in. By multiplying by the 
number of trials n, Eq. (6.33) can easily be expressed as a function of the number 
of successes x. The obtained interval is then related to the expected number of 
successes [L: 


wext x(1-=). (6.34) 


n 


This formula, which is used in practice when nf, n(1 — f) > 20,30, has been 
used in the introductory example of Table 6.1. It is easy to remember, because the 
interval is centred at the measured value, the variable T ~ N(0O, 1) of Eq. (6.29) 
is pivotal, Eq. (6.9) holds, the statistical error is the same as the standard deviation 
of the binomial distribution (with the probability p replaced by the frequency /) 
and the confidence levels CL are Gaussian. Finally, we note that the accuracy of 
Wilson’s formula (6.31) can be improved by applying the continuity correction to 
the frequency f = x/n, which generally improves the coverage of the confidence 
intervals when the variable is discrete: 

i. (6.35) 


n 


In the Gaussian approximation (when |f/2| = t1—«/2), one obtains the following 
interval estimation: 


p € [max(0, p_); min(1, p+)], (6.36) 


6.7 Probability Estimation from Large Samples 215 


with 


(6.37) 


This equation provides a good over-coverage and is, on average, smaller than the 
interval obtained from Eqs. (6.18, 6.19) [Rot10]. 

The coverage properties of Eqs. (6.18, 6.19, 6.37), as a function of sample size 
and confidence levels, will be explored later in the context of simulation techniques, 
in Sect. 8.11. Its reading is strongly recommended to those interested in probability 
estimation. 

In general, we can state that Eq. (6.37) gives correct results fornp, n(1—p) > 10, 
whereas Eq. (6.32) should be used when np > 100. Equations (6.31, 6.37) are 
implemented inside the R routine prop.test (x,n,alt,conf, corr), where 
x and n are the successes and the trials, respectively, alt is the type of estimate, 
that is "two.sided” (default), “less” 0 “greater”, whereas conf (default 
= 0.90) is the confidence level. Finally, corr (default = TRUE) indicates whether 
or not the continuity correction is applied. This routine also prints messages related 
to a hypothesis test with p = 0.5 that should be ignored in this context. 


Exercise 6.4 

During a projection of the election results, 3000 ballots were randomly 
sampled from the total population of voting cards and examined. The A party 
got 600 votes. Give the final forecast (projection) of the results. 


Answer Since n = 3000, f = 600/3000 = 0.20 and nf, n(i — f) > 10, 
Eq. (6.33) can be used, with Gaussian confidence levels. Therefore, one 
obtains: 


peft, (ees = [0.13, 0.27] = (20.040.7)% CL= 68.3%, 


Pee Cian ee = - 
pe f +2,/-——— = (0.186, 0.214] = (20.041.4)% CL=95.4%, 
n 


i 
pe f=3 faa/f = [0.179, 0.221] = (20.0422.1)% CL=99.7%. 


n 


(continued) 


216 6 Basic Statistics: Parameter Estimation 


Exercise 6.4 (continued) 
The R routine prop. test can be used with the following lines of code: 


prop.test (600,3000,conf=0.683), 
prop.test (600,3000, conf=0.954), 
prop.test (600,3000, conf=0.997), 


which gives three estimates very close to those previously found with the 
approximate formula. 

Notice the surprising precision obtained even with a limited number of 
voting cards. Strictly speaking, since the voter population is very large but 
finite (millions of citizens), a correction should be made to these results, as 
shown in the next Sect. 6.13, but here it is absolutely negligible. In this type 
of prediction, the real difficulty lies in obtaining a truly representative sample 
of the total population. In general, a sample is defined as representative or 
random when a single individual from any group has a probability of being 
chosen proportional to the group’s size in the total population (see also the 
Definition 6.4 below). In samples from a natural or physical phenomena, such 
as those obtained in a physics laboratory, nature itself provides a random 
sample, if no mistakes or systematic errors are made during the measurements 
(see also the discussion in Chap. 12). However, the situation is very different 
in social or biological sciences, where the methods of sampling from a 
population are so important and difficult to form a special branch of statistics. 
Those interested in these techniques can consult [Coc77]. 


Equation (6.32) also allows the determination of the sample size necessary to 
keep the statistical error below an a priori fixed value. This result is easily reached if 
we square the statistical error present in the equation and exchange f with the true 
value p, obtaining the variance: 


1— 
Dee lee 
n 


(6.38) 


Incidentally, we note that this equation is identical to (3.5), except for the division 
by the factor n7, since here we consider the variable F = X/n. If we now set to 
zero the derivative with respect to p, we get: 


d 1- 1 1 
dp n n 2 


6.7 Probability Estimation from Large Samples 217 


Substituting the maximum value p = 1/2 into Eq. (6.38), we obtain the upper bound 
for the true variance, as a function of the number of trials n: 


ae (6.39) 


This formula has the remarkable property to give an upper bound for the variance 
regardless of the value of the true probability p. It is therefore possible to determine 
a universal formula for the number of trials required to keep the interval estimate 
£t1—a/2 Omax below a certain predetermined value, for a certain confidence level 
CL = | —aq. Indeed, from Eq. (6.39) one gets: 


2 
N_a/2 


eS (6.40) 
4 (t—-a/2 Orie)” 


Exercise 6.5 
Find the number of samples needed to have an absolute interval less than 4 
per thousand with a confidence level of 99%. 


Answer Since large samples are now considered, we can use Table E.1, that 
gives a quantile tj_o.005 = 2.57 for CL = 99% and a Gaussian tail area of 
0.495. The requested interval is =£t}_9/2 Omax = +0.002. By inserting these 
values in Eq. (6.40) one immediately obtains: 


Ose 
eg eS ee Mie 
"= 4. (0.002)2 


Exercise 6.6 
In 20 independent Bernoulli trials, 5 events were recorded. What is the 
probability estimate of the event, with a CL of 90%? 


Answer We have now x = 5,n = 20, f = 5/20 = 0.25. We are therefore in 
the case of small samples. If we introduce the data x = 5,n = 20, CL = 0.90 
in the routine binom.test (5,20,conf=0.90) , the following values 
are obtained: 


pi =0.104, p2=0.455. 


(continued) 


218 6 Basic Statistics: Parameter Estimation 


Exercise 6.6 (continued) 
Therefore, according to Eq. (6.17), the interval estimate is: 


p € (0.104, 0.455)= 0.257072, CL=90%. 


We can also solve the problem in an approximate way, by applying Eq. (6.37) 
with a value t = 1.645, deduced from the usual Table E.1 of the Gaussian 
probabilities, as an intermediate value between the areas 0.4495 and 0.4505. 
The approximation consists precisely in this last assumption on the Gaussian 
levels of t, and not in the use of Eq. (6.37), which is general. From Eq. (6.37) 
or from the R routine prop.test (5,20,conf=0.90), one obtains the 
interval: 


+0.21 
p € (0.110, 0.458] = 0.254021 | 


In an even more approximate way, we can use Eq. (6.32) with the same ft 
value. The result is: 


p € [0.09, 0.410] = 0.25+ 0.16. 


As you can see, the three methods give slightly different results. This fact will 
be analysed in detail in Sect. 8.11, using simulation techniques. 


6.8 Poissonian Interval Estimation 


The determination of confidence intervals can also be extended from the binomial 
to the Poisson case. When a number x of counts is observed, the mean pz can be 
estimated, in analogy with Eqs. (6.18, 6.19): 


lee) uk x uk 
Yo exp(-mi) =e1, D> exp(—2) = 2, (6.41) 


k! k! 
k=x k=0 


where, for x > 0, the first equation is equivalent to: 


x—-1 pk 
1 = 
1 Bh expan) =r. 
k=0 
In the symmetric case, one usually sets c} = co = (1 — CL)/2. Here too, the 


over-coverage of the interval, in agreement with Eq. (6.6), is guaranteed by the 


6.8 Poissonian Interval Estimation 219 


presence, in both sums, of the measured value x. For these estimates we can use the 
R routine poisson.test (x,conf,alt) , where x is the number of observed 
events and conf and alt, as usual, indicate the type of estimate (” two. sided” 
is the default value) and the CL (with 0.95 as default value). 

Under the Gaussian approximation, since for the Poissonian o~ = uw, the 
confidence interval for the expected value jz can be evaluated from the pivotal 
quantity: 


2 


|x — q| 


< |te2l , (6.42) 


which is distributed according to the standard normal p.d.f. Also in this case, as 
in Eq. (6.37), the values |fg/2| = ti—a/2 are the standard Gaussian quantiles at 
the extremes of the interval with an area under the curve equal to CL = 1 — a. 
Introducing, as in Eq. (6.37), the continuity correction: 


X4 


=x+05 ifx 40, (6.43) 


and solving Eq. (6.42) for jz one obtains: 


iis i 
Ee << + |ta/aly/ x4 + << (6.44) 


N 


UL € Xt 
The knowledge gained from experience with simulated data, which we will 
discuss later in Sect.8.11, shows that this interval has excellent over-coverage 
properties and can be used as an alternative to the correct interval (6.41) for x > 10 
[Rot10]. When x > 100 Eq. (6.44) can be replaced by the asymptotic interval: 


wext |talvx, (6.45) 


that can be obtained directly from Eq. (6.42) with the error plug-in a ~ ./x. To 
better understand the practical use of these formulae, we recommend to take a look 
at Problem (6.12). 

The interval of Eq. (6.42) can be obtained with our routine PoissApp (x, 
conf, alt), where the arguments have the usual meaning. The default value 
of CL is conf=0.68, whereas alt=" two”. 

In line with the scheme of Fig.6.3, the first of Eq. (6.41), with cp = 
1 — CL, also allows to solve in a general and not approximate way the 
problem of evaluating, for an assigned CL, the Poisson lower bound when 
x events have been obtained. The R command, given x observed events, is 
poisson.test (x,alt="greater’). Instead, to find the upper bound with 
an assigned CL, the second of Eq. (6.41) must be used, with cz = 1 — CL. The 
R command is poisson.test (x,alt="less’”). As in the previous case, if 
a CL other than 0.95 is required, the command conf = CL is necessary. For 


220 6 Basic Statistics: Parameter Estimation 


MU =np 
we 


1-CL 


\ livia, 


12345678 91011 


Fig. 6.6 Graphical representation, for x = 1, of the second of Eqs. (6.41) 


x = 0, 1, 2, defining 42 = pw, one has (see also Fig. 6.6): 


“-F=1-CL, 


ae ee Oe a 
2 
ee ie = 1—CL 


e 


and so on. Table 6.2 avoids the task of solving the above equations. For example, 
the table shows that, when > 2.3, on average no events will be observed in a 
fraction of experiments <10%. Similarly, if 4.74 is the upper limit for x = 1 and 
CL = 95%, this means that, when yz > 4.74, the values x = 0, 1 can be obtained 
in a fraction of experiments <5%, according to Fig. 6.6. 

Instead of Table 6.2, the routine poisson.test may be used. For example, 
when x = 2 one has: 

poisson.test (2,conf=0.90,alt="less”)=5.322, 
poisson.test (2,conf=0.95,alt="great”)= 6.297. 


Table 6.2 Poissonian upper 
limits jz2 of the mean number 


90% 195% |x |90% | 95% 


2.30 | 3.00 | 6 | 10.53 | 11.84 
of events in correspondence 
of x observed events, for 90% 3.89 sll 4.74 | 7 [ 11.77 E 13. AS_ 
and 95% confidence levels 5.32 | 6.30 | 8 | 13.00 | 14.44 
6.68 | 7.75 | 9 | 14.21 | 15.71 


7.99 | 9.15 |10 | 15.41 | 16.96 
9.27 | 10.51 | 11 | 16.61 | 18.21 


wml a}wlrmolelols | 


6.8 Poissonian Interval Estimation 221 


Approximate upper and lower limits can be also determined from Eq. (6.44): 


ty ty 

By = x + —* = |t-alyfx- + i (6.46) 
12 12 

Mu = x4 4+ 5 t Meal X44 a ; (6.47) 


where a = 1 -— CL. The solutions of Eqs. (6.46, 6.47) are calculated by 
Poiss.App, where x_ = x — 0.25 and x, = x + 0.5. The choice x_ has been 
empirically determined by comparison of the results with those of poisson.test 
when x < 10. 


Exercise 6.7 
In an experiment, 23 counts have been recorded. Find the upper limit at CL = 
0.95. 


Answer From Table E.1, it results that the quantile value corresponding to the 
tail of area a = 0.500 — 0.450 = 0.050 is |t@/2| = 1.65. Under the Gaussian 
approximation, valid for 4 > 10, one can write, according to Fig. 6.7: 


pu — 23 
Vi 


This corresponds to the second degree equation in ,/U: 


= 1105) . 


jw —1.65./ —23 =0. 


The positive solution of this equation is ./f@ = 5.69, therefore the required 
value is 4 = 32.37. 
The R routines give the result: 


poisson.test (conf=0.95,alt=“less”) = 32.585 
and, in an approximate way: 

PoissApp (23, conf=0.95,alt=“upp”) = 32.410 
Notice the approximate solution of Eq. (6.45) 


jie = 236 WLOD~a/23} = 3109) . 


which is quite different from the exact result in Gaussian approximation. 


222 6 Basic Statistics: Parameter Estimation 


x=23 u=32.4 


Fig. 6.7 Determination, using the method of Exercise 6.7, of the upper limit yw of a Poissonian 
process with 23 recorded counts 


6.9 Mean Estimation from Large Samples 


The estimation of sample mean and variance for any type of random variable is 
a problem that does not have a general solution. However there are fundamental 
formulae always valid for large samples, for any variable and when N > 100. On 
the contrary, a general solution exists for Gaussian variables, as we will see shortly 
in Sect. 6.11. Therefore, the estimation of mean and variance for small non-Gaussian 
samples (N < 100) remains undefined. In this case, if the problem requires great 
accuracy, Monte Carlo or bootstrap simulation techniques are used, as we will see 
later in the sections dedicated to these topics. Now we deal with the estimation of 
the mean of large samples in the case of generic variables. 

Unlike the true mean jz, which is a fixed quantity, the sample mean is a random 
variable. In fact, if we produce a random sample of size N from any distribution, 
calculate the mean: 


1 N 
max a , (6.48) 


and repeat the experiment many times; a different result is obtained for each sample. 
Therefore, the sample mean is an estimator: 


according to Eqs. (2.8, 2.71). Hence, if the operator is applied to the random variable 
M, from Eqs. (4.19, 5.74) one obtains: 


6.9 Mean Estimation from Large Samples 223 


Since all the variables X; belong to the same sample and are independent, Var[ X;] = 
o”, where o? is the variance of the parent population. So we get the important and 
simple result: 


N 2 
1 2 1 7 6 
Therefore, we obtain the lo interval: 
pent anh (6.50) 


centred on the mean value and of width equal to the statistical error 7 //N. What is 
the confidence level of this interval? The Central Limit Theorem 3.1 states that the 
density of the sample mean is Gaussian for N >> 1, which in practice becomes N > 
10. In the Gaussian case, the required confidence levels are given, from Eq. (6.9), 
by the probability levels centred at the mean of the corresponding Gaussian density. 
The problem is therefore completely solved, at least for large samples. 

To explicitly indicate that, for large N, confidence levels are Gaussian, Eq. (6.50) 
is often cast in the form: 


x—Nu 
Pix Ags eee ex1xo/( ie (6.51) 
: ° i o/N 


where jz and o are the mean and the standard deviation of the N variables X; and 
@ is the Gaussian cumulative function (3.43). 

Since o is usually unknown, let us now determine for which values of N it 
is acceptable to replace it with s, the observed one, in Eq. (6.50). We represent 
the sample discussed in the previous section for the probability estimation, with a 
histogram where the values zero and one are assigned to the failure and the success, 
respectively. Under the approximation N — 1 ~ N, the mean and variance of this 
sample are obtained from Eqs. (2.53, 2.55): 


=f, 


N-x 


i 


x x 
N WN 


= oy =m) fi = (mP + mr 


l 


=fU-f)+U+f?-2f)f=fa-f). 


224 6 Basic Statistics: Parameter Estimation 


This result shows that the histogram mean corresponds to the frequency f and the 
histogram variance is f(1 — f). From Eq. (6.50), one immediately obtains: 


j= 
pe pamt— = fey. (6.52) 


This result is based on the substitution of the true variance with the measured one 
and on the approximation for large samples N ~ N — 1. Moreover, the frequency 
given by Eq. (6.33) holds for N > 100. The conclusion is that, in Eq. (6.50),0 = s 
is a good approximation for N > 100. Therefore, we can write the mean estimate 
for large sample as: 


Ss 
c ia) Tay , 


where 1 — CL = @ and t)_q/2 is the positive Gaussian quantile. 

The value m can be calculated with the R routine mean (x) fora set of raw data 
contained in a vector x and with our routine MeanHisto (x, fre) fora histogram 
with support x and frequencies fre. The value s* can be obtained, with the same 
notations, from var (x) and VarHisto(x,fre). 

The case of small Gaussian samples will be examined in Sect. 6.11. 


wema (N > 100, Gaussian CL) , (6.53) 


6.10 Variance Estimation from Large Samples 


In statistics we can define two types of variance: one with respect to the true mean, 
the other with respect to the sample mean: 


N N 
1 1 
2 : 2 De an 2 
Say LCi w) , Ss aa Di M). (6.54) 
i= i= 


These two quantities, for N — ov, tend to the true variance o” in the sense of 


Eq. (2.77). In general, S* denotes the variance with respect to M, which is called 
sample variance. In the following, we will distinguish the two variances with the 
notation of Eq. (6.54) only if strictly necessary. 


6.10 Variance Estimation from Large Samples 225 


To start, let us account for the N — | factor present in Eq. (6.54). The reason lies 
in the algebraic relation: 


YG =)? = OG — M+ M = By)? = YT — M) + (M - WP 


= 0% — MY? + DOM — W)? + 2M - w) YOK - M) 
= 0% — M+ N(M =p)’, (6.55) 


where in the last row the property )7;(X; — M) = 0 has been used. Now take a 
careful look at Eq. (6.55): it indicates that the dispersion of the data around the true 
mean is equal to the dispersion of the data around the sample mean plus a term that 
takes into account the dispersion of the sample mean around the true mean. This sum 
of fluctuations is just the reason of the term N — 1. Indeed, by inverting Eq. (6.55), 
one has: 


Di — MY? = DIG — W)? — NM pH)”. (6.56) 


We now apply the mean operator to all members of this equation, according to the 
technique discussed in Sect. 2.11. From Eqs. (2.62, 4.8, 6.49) one obtains: 


(De = wy = Yio: = My’) = (ai = 1) _ n (aw _ w) 


l l 


2 
=o? — N= No? — 0? = (N-1)o?. (6.57) 
i 


Basically, this is the justification of the second of Eq. (6.54), because we see that the 
estimator S? is unbiased since it satisfies the property (2.79): 


(s?) = oe den ” >) aig” (6.58) 


The sample variance: 


1 2 
W(x — M) (6.59) 


226 6 Basic Statistics: Parameter Estimation 


is an example of a biased estimator. Indeed, from Eq. (6.57) it results: 


i »\ N-1 4 
(Foot) = tee (6.60) 


We recall from Sect. 2.11 that an estimator is biased when the mean of the estimators 
Ty, calculated from samples of size N, differs from the limit for N — oo (2.77) 
of the estimator. In this case, the mean differs from value of the parameter under 
estimation, that is, the variance, by a factor (N —1)/N = 1—1/N. The term 1/N is 
called distortion factor or bias; since it vanishes for N = oo, this type of estimators 
is defined as asymptotically correct, because for large N their mean is very close to 
the limit in probability. All these aspects will be discussed in detail in Chap. 10. 

Equation (6.58) can also be interpreted by recalling the concept of degrees of 
freedom, first introduced in Sect.3.8 for the x? distribution. If you estimate the 
mean from the experimental data, the sum of the N squares of the deviations is 
actually made up of N — 1 independent terms, because the sample mean establishes 
a link among the data. 

The sample variance is therefore a unique quantity, provided that the sum of the 
squares of the differences is divided by the degrees of freedom of the statistic. In a 
rather general way, we then arrive to: 


Definition 6.3 (Degrees of Freedom) The number of degrees of freedom v of a 
statistic T = t(X1, X2,..., Xn, 6) which depends on a known set of parameters 
6 is given by the number N of sample elements minus the number k of parameters 
6 obtained from the data: 


oN SE, (6.61) 


Let us now find the variance of the sample variance. This is not a paradox, because 
the sample variance, like the mean, is a random variable, tending towards the true 
variance for N — oo. Similarly to what we did for the mean, we can apply 
the variance operator to the quantities Ss? and S? of Eq. (6.54) and perform the 
transformation (5.74). The operator formalism, defined in Eqs. (2.60, 2.67) and 
already applied in Eq. (6.57), greatly simplifies this calculation, at least for Ss? 
Indeed, we have: 


VarlS2] = <5 DO Varl Xi — 0)71 = i y (cx — u)*) -(%; - w) | 


1 4 1 4 
=o Vea Na l=5, ane ), 


where the fourth-order moment has been indicated with the notation of Eq. (2.59). 
Substituting the true values Aq and o” by the values estimated from the data, that 


6.10 Variance Estimation from Large Samples 227 


ae 
is, Si, and 


! 4 
Da= Ty LI —H) 
- 


one obtains the result: 


_ 4 
D4 Sy 


N (6.62) 


Var[ $7] ~ 


The lo confidence interval for the variance is then given by: 


D4 —s4 
2 2 2 2 Lb 
Oo 68g = Oly) = 3,4) — ae 


A similar calculation can be performed for the variance S 2 but in this case we must 
also take into account the variance of the sample mean M around the true mean. 
The final result, quite laborious to obtain, shows that, with respect to Eq. (6.62), 
corrective terms of order 1/N* appear. Therefore, for large samples, they can be 
neglected, and Eq. (6.62) can then also be applied now, taking into account that now 
the degrees of freedom are (N — 1) and not N. In the following we will use the 
notation: 


g 
° 
= 
5 
3 
3B 


oe (6.63) 


2 D4 — s4 
s* + ,/ ——— _ unknown mean . 
N-1 


What are the confidence levels of Eq. (6.63)? It can be shown that the sample 
variance for any variable tends to the Gaussian density, but a good approximation is 
only reached for samples with N > 100. If the sample elements are Gaussian, then, 
as we will see better in the next section, the sampling distribution of the variance 
is related to the x* density, which converges to the Gaussian density a little faster, 
roughly for N > 30. In general, the random variable “sample variance” tends to a 
Gaussian distribution much more slowly than the corresponding sample mean. This 
fact is justifiable if we observe that, in the variance, the combination of the sample 
variables is quadratic rather than linear. 

Once the confidence interval it rel for the variance is determined, that of the 


standard deviation can be found by defining o € [ys ; (531. The approximate non- 
linear law (5.56) can be applied, where x = s* andz = s = /x = VJs?. Retaining 


228 6 Basic Statistics: Parameter Estimation 


only the first term, one has: 


_ 4 
Vai eis tS (6.64) 


Var[S] ~ = ina 


1 
482 


The lo confidence interval for the standard deviation is then: 


esx Dias (N > 100, Gaussian CL) (6.65) 
oO Soa 4(N — Ds? ; = 7 auSssian . ; 


Our routine VarEst (x,fre,conf,alt) estimates the variance and the 
standard deviation of a data sample with the second of Eqs. (6.63) and (6.65), 
respectively. As usual, x is a row data vector if fre is missing, whereas it is the 
histogram bin value if the frequency vector fre is given; conf is the CL and alt 
is the estimate type, (“two”, “low”, “upp”). The upper and lower limits for 
an assigned CL are calculated with the formulae: 


2 D4 —s4 D4 —s4 


— 2 _ 
0g =S° +h-a Wo? ou =S t+ti_a@ AN Ds?’ (6.66) 
D4 — s4 D4 —s4 
2 2 
= — th F =s —fy_ Ooo 5: 6.67 
a at ie N-1 ioe at 4(N — 1)s2 ( ) 


where a = 1 — CL and f is the Gaussian quantile. 
If the sample elements are Gaussian, then relation (3.33), i.e. Ay = 3 o“, holds, 
and Egs. (6.63, 6.65) become respectively: 


ot est to? ,/— ~s* 4s? ,/——_,, (6.68) 


(6.69) 


from which, for CL = 1— a: 


2 2 


a ee < o < a : (6.70) 
14+ t-ej2V2/(N — 1) 1 = tr-apV2/(N — 1) 


IA 


= ee _————— 6.71 
1+ tapVl/[2(N — 1] ee 1— tap 1/[2(N — 1] i) 


where t}—/2 are the Gaussian quantiles. The value N — 1 must be replaced by N 
when the dispersions are calculated with respect to the true mean. 


6.11 Mean and Variance Estimation for Gaussian Samples 229 


Equations (6.53, 6.63, 6.65, 6.70, 6.71) solve the problem of estimating 
mean, variance and standard deviation for large samples. Sometimes, instead of 
Eqs. (6.70, 6.71), the right sides of Eqs. (6.68, 6.69) are used, where o is replaced 
by s in the statistical error formula. 

We will shortly give some examples of statistical estimates, after describing the 
estimates from Gaussian samples. 


6.11 Mean and Variance Estimation for Gaussian Samples 


The estimation of mean and variance for small samples (NV < 100) gives intervals 
and confidence levels that depend on the type of the parent population of the sample. 
For this reason, at odds with what we have just discussed, it is not possible to obtain 
formulae of general validity. However, by applying the elements of probability 
theory so far developed, it is possible to get a simple and complete solution, at least 
for the most frequent and important case, that of Gaussian samples. The approach is 
based on two fundamental points. The first point is that, as shown in the Exercise 5.3, 
the sample mean M is also Gaussian for any N, since it is the sum of N Gaussian 
variables. Then, due to Eq. (6.49), also the variable: 


1S! jy 


or 


is a standard Gaussian variable for any N, that is, a pivotal quantity. The second 
point can be summarized in two important theorems: 


Theorem 6.1 (Independence of M and S$?) If (X1, X2,..., XN) is a random 
sample of size N coming from a Gaussian population g(x; 4,0), M and S* are 
independent random variables. 


Proof The sample can be considered as an N-dimensional Gaussian vector belong- 
ing to the space of Gaussian samples. Therefore, we can consider the subspace 
formed by the N sample means M of N vectors (samples) belonging to the space: 
P(M)X = (M,M,..., M). Since this is also a Gaussian vector, we can build the 
orthogonal subspace with the third of Eq. (4.75): 


P(M+)X =[I — P(M)|X = (X1 —M, X.—M,...,Xy —M). 


Indeed, vectors belonging to these two subspaces have a null scalar product 
(4.74), as one can easily verify. From Cochran’s Theorem 4.5, it results that 
M and |P(M+)X|? = °%,(X; — M)? are independent. Since (N — 1)S? = 
|P(M+)X|? ~ x2(N — 1) the theorem is proved. oO 


230 6 Basic Statistics: Parameter Estimation 


Theorem 6.2 (Sample Variance) /f(X,, X2,..., Xn) isa random sample coming 
from a Gaussian population g(x; 42, 0), the variable: 


hg 1 (X; — M)? 
Se 6.72 
Or t= Wo ~ (6.72) 


follows the reduced x? density (3.72) with N — 1 degrees of freedom. Therefore, it 
is a pivotal quantity with respect to 07. 


Proof From the previous theorem, we deduce that (N — 1s" =(|P (M+)X | >: hence, 
we can apply Cochran’s Theorem 4.5 to the vector (X; — M)/o, Gi = 1,2,...,.N), 
so that the theorem is proved. 

Notice that if the terms of Eq. (6.56) are rearranged and divided by o7, one 
obtains: 


.— y)2 
Cp y) 23 2) 


o2/N . a2 ; 
oO ) i 
N-1 1 
N degrees of freedom 
according to the additivity Theorem 3.4 for the x* variables. oO 


The confidence interval for the mean can then be determined using the results of 
Exercise 5.5. Indeed, the variable (5.41), which in this case is: 


M- wu _M-u _M- — 
T= VN VN VN (6.73) 
JOr ° 


turns out to be the ratio between a standard Gaussian variable and the square root 
of a reduced x7 variable. Since these two variables are independent each other, 
this ratio follows the Student’s distribution with N — 1 degrees of freedom. The 
Student’s quantiles are tabulated in Table E.2 (notice that N appears in Eq. (6.73), 
but the variable T has N — 1 degrees of freedom). 

Equation (6.73) shows that the Student’s variable provides the definition of a 
pivotal quantity for the mean jz without using the true variance (usually unknown) 
o” for the determination of the confidence interval. Indeed, if t1-a/2 = —tw/2 are 
the T quantile values corresponding to the fixed confidence level CL = 1 — a as 
shown in Fig. 6.8, inverting Eq. (6.73) one obtains: 


Ss 


— th one Su Sm th oD Te 


As it is easy to deduce from Table E.2 and Fig.5.4, for N > 100 the Student’s 
density is practically identical to a Gaussian, and Eq. (6.74) coincides with 
Eq. (6.53). Recall that generally, in the case of a standard Gaussian variable, ¢ is 
not explicitly indicated or is denoted by zy. 


(6.74) 


6.11 Mean and Variance Estimation for Gaussian Samples 231 


Fig. 6.8 Confidence intervals s(t) a) 2 b) 
corresponding to the 

tw/2 < ti—a/2 quantile values 
of the Student’s variable (a) 
and of the reduced 
chi-squared variable (b) for 
the estimates of mean and 
variance from small Gaussian 
samples 


a/2 l-a/2 XR(a/2) R (1-a/2) 


We postpone the exercises on the use of these formulae to first deal with the 
estimation of variance. In this case Eq. (6.72) gives the pivotal quantity S?/a* ~ 
xe (N — 1). Following the procedure of Eqs. (6.11, 6.12), we start by defining the 
probability interval: 


| 


2 2 
XR (a/2) = = FS XR (1-a/2) » (6.75) 


to 


oO 


where Nate 2)? XRa-2 jo) are the quantile values of Qr corresponding to the 
requested confidence level CL, as shown in Fig. 6.8. Therefore, by inverting the 
interval (6.75), we obtain the interval for the variance estimation corresponding to 
the measured value s?: 


s? 2 s2 
<o’< 


, N-—1 degrees of freedom . (6.76) 


2 = 72 
XR(1—a/2) XR(a/2) 


The probability a is connected to the confidence level CL through Eq. (6.14). The 
upper and lower limits fora = 1 — CL are given by: 


2 s 2 s° 
y= >a> LT=>a > (6.77) 
XR(w) XR(1—a) 
and the confidence interval for the standard deviation is simply given by: 
Ss Ss 
<o < ——.. (6.78) 


[2 [2 
XR(1—a/2) XR(a/2) 


232 6 Basic Statistics: Parameter Estimation 


For values N > 30, the x7 density is close to the symmetric Gaussian shape, centred 
around the average (Qr) = | and with standard deviation: 


2 


oy =o [Qrl = Wai: 


in agreement with Eq. (3.71). In this case the quantile values can be written as: 


XR(a@/2) = 1-t-a/2% » XR—a/2) =1+h-a/2% , 


and the width of the confidence interval with CL = 1 — a becomes: 


s2 s2 
——____. <o” < ——___. (6.79) 
1+ t-a/2 Ov 1— t-a/2 Ov 


which is again the interval (6.70). 

Our routine MeanEst (x, fre, conf,alt), where the argument have been 
already described before, estimates the mean with Eq. (6.74) using the Student’s 
quantiles. They become practically identical to the Gaussian ones of Eq. (6.53) for 
(N > 100). Our routine VarEst (x, fre, conf,alt), already described above, 
estimates the variance using Eqs. (6.76, 6.77). 

These routines allow a useful comparison between the formulae for large samples 
and those for Gaussian samples 


6.12 How to Use the Estimation Theory 


In the previous sections, we have deduced the fundamental formulae for the 
parameter estimation. They are those commonly used in the analysis of the data that 
are usually collected in many different scientific fields, from physics to engineering 
and biology. The overall picture is summarized in Table 6.3, which shows that the 
formulae derived above solve the estimation problem in a simple and general way 
for the case of large samples. It is also evident that the estimation of the mean and of 
the dispersion for small non-Gaussian samples remains not well defined. However, 
this is a case that occurs quite rarely in practice and for which it is not possible to 
give a general solution, because both the intervals and the confidence levels depend 
on the specific distribution involved in the problem. As we have already mentioned, 
in these cases simulation techniques are often used with success. 

From Table 6.3 it also results that, for large samples, all the variances of the 
estimators for frequency, mean, variance, etc. are of the form OG /N, where oo is 
a constant. Therefore, the Kolmogorov condition (2.76), sufficient for the almost 
certain convergence of all these estimates towards the true values, is fully satisfied. 
Weak convergence is therefore also verified, as can be directly seen from the 


6.12 How to Use the Estimation Theory 233 


Table 6.3 1o confidence intervals and corresponding distributions for the determination of the 
confidence levels (CL) for parameter estimation. Samples with N > 100 are usually considered 
as large samples. The symbol ? indicates the lack of a general solution 


Any variables 
CL 


Probability 


Nf <10 Equations (6.18, 6.19) | Binomial 


Nf > 10 : 
N(— f) > 10 


Mean 
N < 100 


N > 100 


Variance 


N < 100 


N > 100 


Std. Dev. 


2 De= 4 
N < 100 “—<o< - | —4+ 
XR XP 4s-(N — 1) 
N > 100 Equation (6.71) G a 
i uation auss —_— 
4 4s2(N — 1) 


Tchebychev inequality. Indeed, Eq. (3.93) can be written also in the form: 


2 


oO 
P{IX—ul> Ko} < PIX-wl>e<3, 
€ 


RK” 


where K = €/o. If T = Ty is an estimator given in Table 6.3, Var[Ty] = 06 /N 
and we obtain: 


2 
Aityeteds—. 
~~ Ne2 


Therefore, Eq. (2.73) is verified for N — oo. 


234 6 Basic Statistics: Parameter Estimation 


Exercise 6.8 

A Gaussian sample with N elements has mean m = 10 and standard deviation 
s = 5. Estimate, with a confidence level of 95%, mean, variance and standard 
deviation for VN = 10 and N = 100. 


Answer Since the sample is Gaussian, we can use Eqs. (6.74, 6.76, 6.78), 
which require the determination of the quantiles of the Student’s and x? 
distributions for a = 0.025 and 0.975. These distributions are tabulated in 
Appendix E. 

Since the Student’s density is symmetric, from Table E.2 it results: 


2.26 for N = 10 (9 degrees of freedom) 


veer ame | 1.98 for N = 100 (99 degrees of freedom) 


From Eq. (6.74) we then obtain the mean estimate: 


5 
€ 10+ 2.26 — =10.0+3.6 (N=10, CL=95%) , 
e /10 : 2 


5) 
pw € 10+ 1.98 —— = 10.0410 (N= 100, CL = 95%) . (6.80) 
Vv 100 


For the dispersion, we must use the quantiles of the ve density. From 
Table E.3 one obtains: 


2 2 _ | 0.30, 2.11 for N = 10 (9 degrees of freedom) 
XR0.025* XR0.975~ ) 9.74 1.29 for N = 100 (99 degrees of freedom). 


From Eq. (6.76) and its square root, the estimates of the variance and of the 
standard deviation are obtained: 


ote [2, B] =(19, 83.3] (= 10,CL = 95%) 


(6.81) 
oe [% ; | = 119.3, 33:8) (8 = 100° CL = 95%) 
o €[VI1.9, 83.3] = [3.4, 9.1] (N = 10, CL = 95%) 

6.82 
o €(VI93, 733.8] = (4.4, 5.9] (N= 100,cL=95%). ©? 


(continued) 


6.12 How to Use the Estimation Theory 235 


Exercise 6.8 (continued) 
As a useful comparison, we apply the approximate formulae, assuming 
Gaussian confidence levels. From Exercise 3.8, or directly from Table E.1, 
we obtain 1.96 when CL = 0.95. 

For the mean estimate we can use Eq. (6.53): 


5 
€ 10+ 1.96 — = 10.0+3.1 (N=10), 
3 Vv 10 


5 
€ 10+ 1.96 —— = 10.0+1.0 (N=100), 
2 100 


which gives (within rounding) a result identical to Eq. (6.80) for N = 100, 
and a slightly underestimated result for N = 10. In fact, as can be seen from 
Fig. 5.4, the Student’s density tails subtend areas slightly larger than those 
subtended by the standard Gaussian. 

To roughly estimate the dispersion parameters with CL = 95%, we use 
Eqs. (6.70, 6.71) again with tf = 1.96: 


25 25 
2 
co |, ea a ty, 
_ Fess | - ) 
25 25 
2 
loci 00r 
—— — 7 ) 


o €[3.6, 18.1] (N= 10), 
o € [4.4,5.9] (N= 100). 


If we compare these results with the correct ones of Eqs. (6.81, 6.82), we 
notice that the approximate dispersions are only acceptable for N = 100. 
It is also possible to use the right sides of Eqs. (6.68, 6.69) with s ~ o, 
obtaining the intervals o* é€ [1.9,47.1], 0 € [2.7,7.3] for N = 10 and 
o? € [18.1, 31.9], 0 € [4.3,5.7] for N = 100. As you can see, for large 
samples, Eqs. (6.68, 6.69) can also be used. The problem can also be solved 
with our routines MeanEst and VarEst. 


236 6 Basic Statistics: Parameter Estimation 


Exercise 6.9 

The analysis of a sample of 1000 electrical resistances (resistors) of 
1000 2 has shown that the values are approximately distributed according 
to a Gaussian with standard deviation s = 1082 (actually the production 
processes of dough resistors well verify the conditions of the Central Limit 
Theorem 3.1). To keep this production standard constant in time, a quality 
control was planned by periodically measuring a sample of five resistors with 
a highly accurate multimeter. Define the statistical limits of the quality control 
at a 95% confidence level. 


Answer We have to assume the nominal value of the resistors as the true 
average value of production: w = 10002. 

The true dispersion of the electrical resistance values around the mean can 
be estimated from the data obtained from the sample of 1000 resistors by 
applying Eq. (6.69) with s ~ o: 


@ EUse = l0Mse 0 . 


s 
V2(N — 1) 


which shows that the sample of 1000 resistors gives an estimate of the 
dispersion with a relative uncertainty of 2 %. Therefore, we can assume the 
value s = 10 £2 as the true value of the standard deviation. 

Since the observed sample is Gaussian, from Table E.1 we can say that the 
interval 1000 + 1.960 ~ 1000 + 20 92 contains 95% of all values. Basically, 
only 5 resistors over 100 will fall outside the interval: 


9802 < R < 102022. (6.83) 
The problem is now to establish controls on the produced resistors to verify 
that these initial conditions remain reasonably constant. 
By randomly selecting five resistors, we can set up an adequate quality 
control using the sample mean. In fact, assuming as true values: 


w= 10002 , GM y= le. 


from Eq. (6.50), we obtain that the sample mean of five elements will be 
contained in the interval: 


o 
w+ — = 1000.0+452, (6.84) 
V5 


(continued) 


6.12 How to Use the Estimation Theory 237 


Exercise 6.9 (continued) 

with nearly Gaussian probability levels. Student’s distribution should not be 
used here, because the true standard deviation value is assumed to be known 
from the 1000 resistor measurement. On the contrary, we should deduce it 
from the small five-resistor sample; Eq. (6.84) would still hold, but in this 
case, instead of 0, we would have to use the the standard deviation s of the five 
resistors, and the confidence levels would follow the Student’s distribution 
with 4 degrees of freedom, since N < 10. 

For a first quality control, Eq. (6.84) can be used, with a 95% confidence 
level, which, from Table E.1, is associated to an interval of about 1.960. In 
this case, the probability of error by judging as poor a good resistor is 5%. 
A first quality check will then indicate a possible bad production when the 
sample mean is outside the range: 


10 
1000 + 1.96- = ~ (100049) 2, 
V5 


that is: 
991 922 < m(R) < 1009 2 (first quality control, CL = 95%) . (6.85) 
The global quality check can be further refined by also verifying that the 


sample standard deviation does not exceed the value of 10 §2. Indeed, by 
inverting Eq. (6.78) and taking its maximum, one has: 


s<Jx2,0, (6.86) 


where o = 1092 and ee is the value of Qp(4), that is, the reduced x? 
variable with 4 degrees of freedom, corresponding to the required confidence 
level. For CL = 95%, from Table E.3 one gets: 

tags = 2s 
and hence: 

s<10-V2.372152. 

The second quality check will then report one possible bad production when 
the standard deviation of the sample with five resistors exceeds the limit of 


IS) Gs 


s(R) < 152 (second quality control, CL = 95%) . 


(continued) 


238 


6 Basic Statistics: Parameter Estimation 


Exercise 6.9 (continued) 

If these two quality controls are required to be satisfied at the same time, 
it is ensured that both the average and the initial dispersion of the electrical 
resistances are correctly kept within the arbitrarily chosen confidence level. 

With a single common CL = 0.95, we will have at least one of these limits 
exceeded with probability 1—0.95* = 0.0975, that is, in about 10% of checks. 

A signal outside the confidence band, but within limits of the expected 
Statistic, it is called false alarm. The situation is often summarized graphically 
in the quality control chart, in which a zone of normality (or control zone) is 
chosen; above or below these limits there are two alarm bands and outside 
of these the forbidden zone. Figure 6.9 shows a possible control chart for 
the resistor mean value of our problem. The control zone corresponds to the 
interval (6.85) of width +1.96 0; the alarm zone is from 1.96 to 3.00, while 
the forbidden zone is outside 3.00. A value in the forbidden zone can occur 
under normal conditions only 3 times out of 1 000 controls (30 law), an event 
that can justify the production suspension and the activation of the machine 
maintenance processes (warning: this is a subjective decision that can vary 
from case to case). 

As an exercise, with Eq. (6.86) you can also draw a similar chart (S$ chart) 
also for the dispersion of the data. 

In the alarm zone, on average, we should have 5 values for every hundred 
checks, corresponding to a priori probability y = 0.05. The quality control 
can then be further refined by detecting if an excessive number of false alarms 
occur, that is, if there are too many alarms compared to the number of alarms 
expected when the production quality remains stable. If n is the number of 
false alarms in N checks, Eq. (6.29) can be applied, since here we assume to 
know the true probability y: 


n—yN n — 0.05 N 
i, = 45 (6.87) 
"" J/Nyd—y) JN 


Therefore, the production should be suspended, on the basis of 30 law, when 
t, > 3.0. For a Gaussian variable, this value corresponds to a probability of 
about ~ 1.5 per thousand to wrongly stop a good production. The Gaussian 
approximation is valid for n > 10, which corresponds to Ny > 10, that is, 
N > 200 in Eq. (6.87). For N = 200, it turns out that it is reasonable to 
proceed with maintenance if n > 19. 


6.13 Estimates from a Finite Population 239 


1013.4 
ALARM 


1009 


1000 = 


991 
ALARM 
986.5 


Fig. 6.9 Control chart for the resistor production, as discussed in Exercise 6.9 


6.13 Estimates from a Finite Population 


In the estimates described so far, we have assumed that the population was made 
up of an infinite set of elements. The results obtained are also valid for finite 
populations, provided that, after each draw, the extracted element is replaced into 
the population (sampling with replacement). However, it is intuitive that there are 
some changes to be made in the case of sampling without replacement from a finite 
population, since, if the population were used up, the quantities of interest would 
become certain and would no longer be statistical estimates. We then begin with the 


Definition 6.4 (Random Sample from a Finite Population) A sample S of N 
elements drawn from a finite population of N, units is said to be random if it 
represents one of the Ny!/[N!(Np — N)!] possible sets, each of which has an equal 
chance of being chosen. 


If X is the random variable contained in the sample S, we can write: 


N Np 


1 1 
Me re (6.88) 


where the second sum is over all the N, units of the population and J; a dummy 
variable (see Eq. 2.7), defined as:! 


L= lifx,eS, 
"| 0 otherwise. 


' Tn order not to overload the notation, we write x; € S to indicate that the i-th population unit has 
been extracted. 


240 6 Basic Statistics: Parameter Estimation 


J; is a two-valued binomial variable corresponding to the number of possible 
successes of a single trial having a probability N/N». Therefore, we have: 


\= wu Var[J;] = u (: u Vi (6.89) 
oa Al all] =a -x). i. : 


The obtained result: 


Np 


(M) =— J 0 xi (i) = — Jo xj = (X) , (6.90) 


i=1 P j= 


shows that also in this case the sample mean is a correct estimator of the true mean. 
To find the variance of the sum of the sampled variables, we can write, using 
Eq. (5.65): 


Np 


Np 
Var | So xidi | = Do x? Vari] +20 ¥° xix; Covi, 1). (6.91) 
i=l i=l 


i j<i 


The covariance estimation (Jj, Jj) requires the knowledge of the mean (Ii 1j). 
Having in mind the general definitions of Sect.2.8 and only considering the non- 
zero values, one has: 


—-1N 


N 
ly = PS SSS Pre Sh A ‘ 


Using Eq. (4.25), we can write the covariance as: 


N N-1 WN N(Ny — N) 
Pp Pp 


Inserting this result into Eq. (6.91), one has: 


var |} xiti | = Ne) ((x?} - (x)?) (6.93) 


6.13 Estimates from a Finite Population 241 


where (X?)} = > X?/N> and (X)? = (> X;)°/Nj. Therefore, the variance of the 
sample mean results: 


Np 
re |e _ Np —N 2\ 2 
_ Var[X] Np -N _ Var[X] (1- +) (6.94) 
= N Np 1 7 N Np ) . 


This equation represents the fundamental result for the estimates from finite 
populations: the comparison with the analogous formulae for infinite populations 
(see Table 6.3) shows that the variances of means, frequencies and proportions 
calculated from samples extracted from finite populations must be corrected with 
the factor (NV, — N)/(Np — 1) = A —N/N,>). The same factor must be applied if in 
Eq. (6.94) the true variance o? is replaced by the estimated one s*. For example, in 
the case of the Exercise 6.4, considering 30 millions of voters, the correction would 
be very small and of the order of ./T — 3/30 000. This situation is different from the 
extraction without replacement from an urn: if the frequency of marbles of a certain 
type were, for example, f = 15/30 = 0.50 and the urn contained 100 marbles, the 
frequency error would go from ./0.5(1 — 0.5)/30 = 0.09 (infinite population) to 
the value ./0.5(1 — 0.5)30./70/99 = 0.08. The variance would vanish if all 100 
marbles were drawn. 

The sample variance, unlike the mean, must be corrected for finite populations. 
In fact, from Eqs. (6.57, 6.94) one has: 


1 1 Np —N 
(8°) = —— Sox — My? = —— | No? — 2 0? | = 0? a: . 
N-14 N-1 N,—1 N,-1 


It turns out then that the unbiased estimator for the variance of a finite population 

is: 

Ny-1 
Np 


- = 


1 
ar (Xi - MY’. (6.96) 


The derivation of the correction for the variance Var[S?] is more complicated and 
can be found in [KS73]. 


242 6 Basic Statistics: Parameter Estimation 
6.14 Histogram Analysis 


After the analysis of sample mean and variance, we now move to the analysis of the 
overall sample structure (shape), with the aim of obtaining information on its parent 
population. Indeed, sample mean and variance are not the only random variables 
of interest. Usually the sample is presented in the form of a histogram, subdivided 
into K bins, almost always with fixed width Ax, and each containing a number 
n; of events. The quantities n; give the overall shape of the sample and are of 
crucial importance if one is interested in studying the density structure of the parent 
population. These quantities should be considered as random variables I;, because 
they vary from sampling to sampling. If the histogram is normalized, instead of 
n;, the measured frequencies or probabilities f; = n;/N are given in each bin, 
where A is the total number of events in the sample. As an example, in Fig. 6.10 
and Eq. (6.97) a non-normalized histogram is shown as obtained from a computer- 
simulated Gaussian sample of VN = 1000 events coming from a parent population of 
true parameters « = 70, o = 10. In Eq. (6.97) x; and n; = n(x;) are the midpoint 
and the content of a bin, respectively. The bin width is Ax = 5. 


x n(x) Xx n(x) 
37.5 1 72.5 207 
42.5 4 77.5 153 
475 16 82.5 101 
52.55 44 87.5 42 
57.5 81 92.5 7 
62.5 152 97.5 6 
67.5 186 


(6.97) 


The R routine hist (x) draws the histogram of a raw data set contained in the 
vector x. Without any user input, the bin width and the graphic style of the histogram 
are automatically set by the routine. If you have a vector x containing the abscissas 
of the bins and a vector fre containing the frequencies or the number of events of 
each bin, you can use our HistoBar (x, fre) routine, which draws the histogram 
as in Fig. 6.10 (top). 

If p(x) is the p.d.f. of the population, by defining the random variables J and F 
related to the events {J; = n;} and {F; = n;/N}, it follows, from Eq. (2.33), that: 


(i) = wi = Npi = vf p(x) dx ~ Np(xo) Ax , (6.98) 
Ax 


(Fi) = pi = [ p(x) dx ~ p(xo) Ax, (6.99) 


where Ax is the bin width and xo is a generic point in the bin. The rightmost term in 
the equations follows from the integral mean value theorem. If the bin width is small 


6.14 Histogram Analysis 243 


Fig. 6.10 Two possible 
representations of a histogram 2007 n(x) 
obtained with a computer | 
simulation of 1000 events 150 
from a Gaussian population 
wih Gc Watd@ = 10 100; 
50 
0 ee 1 1 1 1 
40 50 60 , 70 80 90 100 
200 | n@) : 
150; ° : 
100 : 
50; e ° 
ee ee Se re | 
939 40 50 60 , 70 80 90 ~=—:100 


enough and the density is a fairly smooth function, one can assume that it varies 
linearly within the bin width; under these conditions, according to the trapezoidal 
rule, xo is the bin midpoint. 

In the case of discrete random variables, the integral over Ax in Eqs. (6.98 6.99) 
must be replaced by the sum of the true probabilities of the values contained in Ax. 

In Sect.4.7, we have seen that the global probability of having a specific 
experimental histogram of a random sample of size N, given the true probabilities 
pi, G@ = 1,2,...,k) obtained from a p.d-f. p(x) using Eq. (6.99), follows the 
multinomial distribution (4.89). We already noted, commenting on Eq. (4.89), that 
the number of /; events falling in the i-th bin (4;, , xj41) of width Ax follows the 
binomial distribution. Indeed: 


e If the random process is stationary in time, the probability p; to fall in the i-th 
bin remains constant. 

¢ The probability to fall in a bin does not depend on the events previously recorded 
or that will be recorded in other bins. 


Therefore, we can state that the random variable J; (number of event in a 
histogram bin) is given by the occurrence of independent events with a constant 
probability. If the total number of events N is a predetermined parameter, then the 
probability for the random variable J; to take the value n; will be given by the 
binomial law (2.29) with elementary probability p; (see also Eq. (4.89)): 


PUL = ni} = bi; N, pi) = 
n;! 


N! - 
WN =o ts (6.100) 
=e 


244 6 Basic Statistics: Parameter Estimation 


where (1 — p;) is the probability to fall into any histogram bin different from the 
i-th one. The standard deviation is: 


oj = VNpi 1 — pi). (6.101) 


This quantity can be estimated from the data through the uncertainty s; = s(n;). If 
the bin contains more than 20-30 events, Eq. (6.34) can be used, with x = n; and 
n=N: 


= Ing (1 = x) (6.102) 


This approximation is often used even for bin contents above five events. 
If the histogram is normalized, Eq. (6.102) must be divided by N, thus obtaining 
a well-known result, the statistical error (6.33) on the frequency f; = n;/N: 


as 1-4) = id= fo (6.103) 

The two previous formulae are of fundamental importance in the analysis of 
histograms and are called random (or statistical) fluctuations of the bin contents. 

If the histogram is not obtained with a fixed total number N of events but is 
collected considering other parameters, for example, a certain time interval Ar, the 
number JN turns from a constant into a statistical Poissonian variable N, and the 
fluctuations of the bin contents must be calculated in a different way. 

We want to explain this rather subtle point with an example. If we look at multiple 
histograms, each of which refers to the weight of 100 newborns, the fluctuations in 
the number of babies within a certain weight range (or percentile, as doctors say) 
will obey Eqs. (6.102, 6.103). If, on the other hand, we collect the histograms of the 
newborn weights monthly, the fluctuations in the number of babies within a certain 
weight range will overlap to those of the total number N of babies in a month, which 
will be Poissonian with a stable average value (if we assume, to simplify, that the 
births are stable from month to month). To treat this case correctly, it is essential the 
following: 


Theorem 6.3 (On the Binomial and Poissonian Variables) Let X be the number 
of successes in N trials. When N is not a fixed parameter but a Poissonian random 
variable, X follows the Poisson density. 


Proof From the compound probability law, the probability to observe {Jj = nj} 
events into the i-th bin over a total of N events will be given by the product of the 
Poissonian probability (3.14) to observe a total of {N = N} events, when the mean 
is A, times the binomial probability (2.29) to get n; events in the considered bin, 


6.14 Histogram Analysis 245 


over a total of NV, when the true probability is p;: 
Ptii =nj,N= N} = Ptii =nj|N = N}P{N= N} 
! —AAN 


J nj N—n; 
ni =a ) MI 


= p(ni, N)= i (1 pi 


If one now defines m; = N — n; and uses the identities: 


e Xr =e Pi e AC pi) : aN — yN-ni Vm a yimi qni , 
the probabilities can be written in the form: 


ei (Apiy™ eT O-P) [ACL = pid 
p(ni, mi) _ a ae ne ne , 
ti ia 
which is the product of two Poissonians, of means Ap; and A(1 — p;), respectively. 
From this equation and from Theorem 4.1, one can deduce that the number J; of 
events in the i-th channel and the number (N — /;) of the events contained in the 
other bins are both independent Poissonian variables. In other words, if N is a 
Poissonian variable and J;, for fixed {N = N}, is a binomial variable, then J; and 
(N — ];) are independent Poissonian variables. oO 


Since we know that the standard deviation of the Poissonian is equal to the square 
root of the mean, we can immediately change Eq. (6.102) into the form: 


o[L)] =o; = JApi ~ s(ni) = Jn 


where the true values have been replaced by the measured ones. The statistical 
uncertainty of the bin content is then given by: 


Nj 


_!1 -_ ff 
yay vm = 7 (6.104) 


sim = Vm, 5 
Let us now summarize these results in a coherent scheme. The estimate of the true 
number of events jz; (mathematical hope or expected value) in the i-th bin is given 
by Eq. (6.98), and the corresponding approximate 68.3% level confidence interval 
is given by: 


ai en +,|ni (1 = =) (6.105) 


for histograms with a fixed total number of events N. For histogram where N is a 
Poissonian variable, one has instead: 


Mien x JN r (6.106) 


246 6 Basic Statistics: Parameter Estimation 


For normalized histograms, Eqs. (6.105 and 6.106) transform respectively as: 


eye Gg (Rea) na 

pe fix = =e amare. (6.107) 
epee 

pe fix N° (6.108) 


which are the estimates of the true quantities (6.99). 

These formulae are valid for n; > 5, 10, that is, for bins containing at least about 
10 events. In this case the Gaussian confidence levels hold. 

For bins with less than ten events, Eq. (6.107) should be replaced by Eq. (6.31) 
(with N instead of n), and Eq. (6.105) should be replaced by Eq. (6.31) multiplied 
by N. The Poissonian formulae (6.106, 6.108) remain unchanged, but, in these 
cases, the confidence levels are not Gaussian and must be directly obtained from the 
binomial and Poisson distributions, depending on whether JN is fixed or variable. 

All the previous conclusions also give a satisfying intuitive representation of the 
bin content fluctuations. For example, if we consider a two-channel histogram, the 
number of events n; and nz in these two channels is completely correlated when 
N is constant, since nj + n2 = N. In effect, we are dealing with a single random 
variable, and in this case the statistical errors of the two channels are equal, as is 
evident from Eq. (6.103), which is symmetric in f and (1 — f), or from Eq. (6.102), 
after a little bit of algebra. In general, a fixed N determines a correlation between 
channels, given by the covariance (4.92), which statistically, for n;, nj > 10, can 
be estimated with a good approximation as: 


s(nj,nj)=—Nfif; - (6.109) 


On the other hand, when N is a Poissonian variable, any histogram bin behaves as 
an independent Poissonian event counter with fluctuations equal to the square root 
of the number of events. 

The graphical representation of the histogram, to be complete, must then include 
statistical errors. By convention, these errors are evaluated using Eqs. (6.106, 6.108), 
neglecting possible correlation between channels, and are plotted as +s; bars 
centred on the n; values. These intervals, called error bars, define a band that 
should contain the true values 4; or p; of Eqs. (6.98, 6.99). However, as always, we 
must remember that confidence levels, if n; > 5, 10, follow the 30 law. Therefore, 
the total band containing these values with a reasonable certainty is actually three 
times larger than the error bars shown in the graphs. The histogram of Fig. 6.10, 
completed with the error bars from Eq. (6.102), is shown in Fig. 6.11. This repre- 
sentation can be obtained with ourroutine HistoBar(x,fre,errors="ON’). 
In the case of a normalized histogram, the vector fre contains the frequencies, 
and the error request must be completed by the number of events, in order 
to apply Eq. (6.108). For example, in the case N = 100, the call should be 
HistoBar (x, fre, errors="ON”,nev=100). 


6.14 Histogram Analysis 247 


200 | B®) 


150 + 
100 


. 


50 


° 40 50 60 70 80 90 100 


200 /B&) ¢ 


— mt 
i) on 
i—) i—) 
T T 
-- 
oe 
oo 


Fig. 6.11 Histogram of Fig. 6.10 with error bars 


binomial or 


Poissonian . bt 
fluctuation Wha 
o expected 
density value 
function 


Fig. 6.12 Measured and expected values from a population density model, with the statistical 
fluctuation of the bin content. The shaded area should be imagined as projected orthogonally to the 
sheet 


We have shown that the fluctuations in the number of events contained in a 
certain histogram bin follows the binomial or Poissonian probability. This is a 
completely general rule, independent of the density p(x) describing the sample 
parent population, which can be any. As shown in Fig. 6.12, this density instead 
determines the overall structure of the sample, which is described by the mean or 
central values of the bin contents. 


248 6 Basic Statistics: Parameter Estimation 


On this subject, an important hypothesis testing topic is how to check whether 
the shape of the sample is or not in agreement with a density model chosen for the 
population. We will discuss this issue in Sect. 7.5. 


6.15 Estimation of the Correlation 


The sample correlation coefficient obtained from a finite set of data (xj, y;) 
generally has a non-zero value, even when the variables are uncorrelated. It is 
therefore necessary to verify if the sample correlation coefficient r evaluated from 
the data is compatible or not with a null value (hypothesis test) or to estimate 
the confidence interval within which the true correlation coefficient p is located 
(parameter estimation). The sample estimate of the correlation coefficient (4.31) 
for a finite set of N elements requires the preliminary definition of the sample 
covariance S(X, Y) = Syy. 

Without loss of generality, we can consider a pair of centred variables (with zero 
true mean), for which the true covariance is: 


Cov[X, ¥] = ((X — ux) (Y — py)) = (XY) . (6.110) 


To identify possible biases it is necessary, with a procedure similar to that of 
Eq. (6.57), to find the true mean value: 


N 
(de — Mx)(¥; — | 


i=1 


where, in general, My > m, 4 0, My — my ¥ 0 even when the true values of 
the means are zero. By applying the linearity properties of the mean operator and 
noting that: 


\> Mx¥i = Mx YY; = kiN: 
i i j i 


and so on, one has: 


i=l i 


N 
(de — Mx)(¥; — oD) = (x (Xi¥; — Mx¥; — My Xi + mat 


= N(XY) — a (ox >») : (6.111) 
i 


6.15 Estimation of the Correlation 249 


The last term of this equation can be rearranged as: 
> Xj > ri) = (Da) + Y > (xi¥;) : 
i j i ij 


Now, X; and Y; are independent, because they are coming from different events 
sampled independently (remember that the correlation exists for the pair (X;, Y;), 
observed in the same event!). From Eq. (4.9) one then can write: 


(Xi¥;) = (X) (Y) =0, 


since, by assumption, the true means are zero. Since (~~ XY) = N (XY), Eq. (6.111) 
becomes: 


N 
(de — Mx)(¥; — my {> xx - r1pD x) = (N-1)(XY) . 


i=1 


(6.112) 
This result, recalling Eq. (6.110), implies: 
N 
1 
Cov[X, Y] = (a 2 = MP = i) : (6.113) 
i= 
In conclusion, the unbiased sample covariance is: 
fl N 
s(x, y) = rere et —mx)(yi — my) . (6.114) 
i= 


If, instead of raw data, we have a two-dimensional histogram n;; containing the 
number of pairs (x;, y;), the covariance is evaluated as: 


1 
se => Si — mx)(yj — my)ni; - (6.115) 
ij 


The R routine cov (x,y) can be used to calculate s(x, y) from a set of raw data 
while our routine CovarHisto(x,y,mat) performs the same operation for data 
presented in histograms, where mat is the matrix n;; of Eq. (6.115). 


250 6 Basic Statistics: Parameter Estimation 


To estimate the variance of the sample covariance, we can proceed as in the case 
of the sample variance: 


1 
Var[s(x, y)] = W112 Se Varl (xi —mx)(yj —my)] (6.116) 
ij 


= ap [lor - moron my)) (xi — mx) Qi my)) | 


where the variance properties (2.67) and the additivity formula (5.74) have been 
used, since the different terms in the sum of s(x, y) have null covariance. The 
last equality in Eq. (6.116) is approximated, because the exact equation should 
contain the true means. This inaccuracy can be corrected with some additional 
terms, discussed in [KS73], which are in the order of 1/N with respect to Eq. (6.116) 
and, therefore, are generally negligible. We have found that these terms affect the 
error given by (6.116) of 10-15% and only for small samples having about ten 
events. If x and y are two experimental data sets, the R coding of the last term of 
the equation, before the multiplication with N/(N — 1)’, is the following one: 


> meanx = mean (x) 
> Meany = mean(y) 
> mean (((x-meanx) * (y-meany))*2) - (mean ((x-meanx) « (y-meany) )) *2 


This coding is used in our routines CorrelEst and CovarTest. 
Let us now consider the sample linear correlation coefficient, that takes the 
compact form: 
— Sxy i —mMy)(Yi — my) - yo, iyi) /N — mMymMy 
Sx Sy y. (x; _ my) a (yj _ ity)? Sx Sy 


’ 


(6.117) 


where Eq. (4.25) has been used for the last equality. This value is an occurrence of 
the random variable “sample correlation coefficient R”. 

The p.d.f. of R when the true correlation coefficient is o = 0 has been derived in 
1915 by R.A. Fisher. In the case of N pairs of Gaussian variables, this distribution 
(deduced also in [Cra51]), follows a p.d.f. given by: 


r(2 
2 
aa Pye Or, (6.118) 
v 
x (3) r (5) 
where v = N — 2. In the general case where p ¥ 0, the estimation of the true 
correlation coefficient starting from the sample data is a difficult problem, the 


brilliant solution of which is again due to R.A. Fisher, who proved this theorem 
in 1921: 


c(r) = 


6.15 Estimation of the Correlation 251 


Theorem 6.4 (Fisher Z Variable) /f R is a sample correlation coefficient obtained 
from N pairs of Gaussian variables, the variable 


1+R 
gaye (6.119) 
2 1—R 


follows, for N — ©, anormal distribution with mean and variance given by: 


a 1+ o ol 
(Z) = 5 In (4) , Var[Z] = 7) (6.120) 


where p is the true correlation coefficient. 


Proof The complete proof of the theorem is rather complicated and most texts refer 
to Fisher’s original article or to his famous book [Fis41]. 

However, we can give a partial proof of the theorem by considering the case 
p = O and applying the inverse transformation of Eq. (6.119) to the density (6.118): 


-—1 
R= —>——- = tanhZ, 
e2Z +] 


where tanh is the hyperbolic tangent. Since: 


1 dtanh x _ 1 


1 — tanh? x = 5: = TT 
cosh* x dx cosh* x 


indicating with A all the constant coefficients of Eq. (6.118) and taking into account 
that v = N — 2, from the equality c(r) dr = f(z) dz, one obtains: 


1 1 
f(z) = A (1 — tanh? 2) 4-9/2 ____ = 4 —______. 
’ cosh? z cosh\V—2) z 
We can expand the hyperbolic cosine as: 
1. zz ; 
= 7(e& “ya Se Oe oe aoa Le 
coshz = 5{e +e-*) eer at ~ |? | 


where the last approximation holds for z around unity. Due to the presence of the 
logarithm, the values of z remain quite limited, except in the extreme case r ~ +1. 
We can then, with a good approximation, stop this second order expansion and write: 


F(z) = Ae ND? /2., 


Thus it turns out that the variable Z is approximately normal, with zero mean 
and variance Var[Z] = 1/(N — 2). However, the correct result is represented by 
Eq. (6.120). Oo 


252 6 Basic Statistics: Parameter Estimation 


2000 


1500 - 


1000 


500 F 


rho 


XY 

— 

Se 

—) 
T 


b) 


I 
0 0.25 0.5 0.75 1 1.25 15 1.75 2 
z di Fisher 


Fig. 6.13 Histogram of 50,000 correlation coefficients between 20 pairs of uniform variables 
(X, X + V) (a); histogram of the corresponding Fisher Z variable and of the Gaussian (full curve) 
fitting the data (b) 


In practice, the theorem is very powerful, because it holds for N > 10 and 
gives good results even with non-Gaussian variables. To show you the prodigious 
properties of the Fisher’s transformation, we have shown in Fig. 6.13 the histogram 
of 50,000 sample correlation coefficients r, each obtained with 20 pairs of uniform 
variables (X, X + V) of Exercise 4.2, and the corresponding histogram of the 
variable Z. 

For the estimation of the correlation from a dataset ({R = r}, {Z = z}), one 
performs the transformation (6.119), applies the theory of estimation for Gaussian 
variables by determining the extremes of the confidence interval and then reverses 
these extremes by rising to exponential on both sides of Eq. (6.119): 


| ee 
eee” 2 fe (6.121) 
l—r er 41 


A simpler formula for the estimation interval can be obtained by applying to 
Eq. (6.121) the error propagation law and Eq. (6.120): 


d fer =i 4% 1 
S=— ——  ——————————e 
"dz e411) * (e% 41)? J/N—3 


6.15 Estimation of the Correlation 253 


Table 6.4 Height and chest measurement (in cm) of 1665 Italian soldiers of the First World 
War (the data are reported in: M. Boldrini, Statistica (Teoria e Metodi), Editor A. Giuffré, Milan 
1962 (only in Italian)) 


Chest circumferences Totals 
72 76 80 84 88 92 96 100 
Heights 150 1 7 2 2 13 
154 7 27 39 28 6 109 
158 3 7 69 118 87 21 5 310 
162 9 110 190 126 46 2 492 
166 4 68 145 114 58 12 5 406 
170 1 22 46 69 46 12 2 198 
174 1 15 35 22 6 83 
178 5 8 11 16 40 
182 2 10 2 14 
Totals 3 30 312 565 482 218 46 9 1665 


Substituting the value of z in Eq. (6.119), one then has: 


2 2 
i 
giving pert ———— CL~ 68%. (6.122) 


l-r 
VN —3 


Our routine CorrelEst (x,y,conf,alt) evaluates both the covariance from 
Eq. (6.114) and the correlation coefficient from Eq. (6.117) between two raw data 
vectors x and y. The variables conf and alt define CL and the type of estimation 
"two”, “low” and “upp” according to the scheme of Fig.6.3. The error 
of the correlation coefficient is estimated by Fisher’s method with Eqs. (6.120- 
6.121). The standard deviation error is found with Eq. (6.65) and the covariance 
error with Eq. (6.116) and the bootstrap method, a technique that we will describe 
later. If the data are represented as a two-dimensional histogram, the routine 
CorrelEstH(x,y,mat,conf,alt) can be used, where x and y are the bin 
coordinates and mat the two-dimensional matrix containing the number of events 
in the cell (x, y). 


Sp = 


Exercise 6.10 
Determine the correlation coefficient between height and chest size from the 
data of Table 6.4. 


Answer The table, graphically reproduced in Fig.6.14, represents a two- 
dimensional histogram. The class sizes can be easily deduced from its 
structure: for example, there are 110 soldiers with chest circumference 


(continued) 


254 6 Basic Statistics: Parameter Estimation 


Exercise 6.10 (continued) 
between 78 and 82cm (central value 80) and height between 160 and 164cm 
(central value 162), and so on. 

If t; and s; are the spectral values of the chest and height, the marginal his- 
tograms n(t;) and n(s;) have a Gaussian form, as can be easily deduced from 
the graphs (check this as an exercise). The means and standard deviations of 
the marginal histograms are an estimate of the corresponding true quantities 
of the marginal densities of the chest circumference and height. They can be 
calculated using Eqs. (2.41, 2.42). With obvious notation, one obtains: 


(W223 ae BO? WO FE 000) = OSI 


Mt 


~ 1665 


(72 =85.71)2 °3-- (6 — 85.71)7- 304...) 
{=| ee ee | Sa 
1664 


1 
= — (150-1 154-1 coo) = IOI 71 , 
Ms 1665 ‘ 50-13+154-109+...) 63.7 


Wiiso= test) a ed los 71) 09s) oe 
|= = {66 = 5.7. 


The two standard deviations s; ed s; have been calculated dividing by (1665 — 
1), since the sample means have been used. 
The lo estimate of the chest and height means is given by Eq. (6.50): 


4.46 
My, € 85.714 & fo ae OLIh . 
665 
5.79 
Ms € 163.71 + && O37 se0).Il . 
: 1665 


We now come to the study of correlation. The data should be correlated, 
because experience shows that short men with huge chests, or vice versa, are 
very rare. We therefore calculate the sample covariance (6.115): 


1 
= — [(72 — 85.71)(158 — 163.71) -3 
oo a EEA [( M We 


(76 — 85.71)(150 — 163.71). 1+ 
(76 — 85.71)(154 — 163.71)-7+...] = 6.876. 


(continued) 


6.15 Estimation of the Correlation 255 


Exercise 6.10 (continued) 
The sample correlation coefficient (6.117) is then given by: 


Sst 6.876 


— = ————__. = 0.266. 
Ss St 5.79 - 4.46 


lige = 

We now estimate the height-chest correlation coefficient of the soldiers 

with a CL = 95% . First, we calculate the Z variable corresponding to the 
sample correlation r = 0.266: 


1 1 + 0.266 


= In —— = 0.272. 
2 1-—0.266 


L£= 


This is an approximate Gaussian variable, with standard deviation given by 
the second of Eq. (6.120). Since the data refer to 1665 soldiers, one has: 


T 1 
o =,/—— = ,/—— = 0.024. 
=a 1662 


The 95% confidence limits for a Gaussian variable are given by the standard 
variable t = 1.96: 


fez € 0.272 + 1.96 - 0.024 = 0.272 + 0.047 = [0.225, 0.319] . 


This interval then contains the value jz, with CL = 95%. The confidence 
interval for the true correlation coefficient p is finally evaluated by inserting 
the values (0.225, 0.319) in the second of Eq. (6.121): 


p © (0.2210 309) = 0.266075. CL = 95%, 

Therefore, the height-chest data clearly demonstrate the presence of a cor- 
relation. This same interval can also be calculated with the approximate 
formula (6.122). 

Moreover, all the previous results can be obtained as well by inserting the 
data of Table 6.4 into our routine CorrelEstH(). 

One may ask the question: what chest circumference must a 170cm tall 
soldier have to be considered normal? If we take the histogram of Table 6.4 
as a reference sample, we can answer the question by estimating, with 


(continued) 


256 6 Basic Statistics: Parameter Estimation 


Exercise 6.10 (continued) 
Eqs. (4.54, 4.55), the mean (on the regression line) and the standard deviation 
of the chest t conditional on the height s: 


m(t|s) = m; + rer £ (s — ms) = 85.71 + 0.204 (170 — 163.71) ~ 86.9 cm 
Ss 
2 Ded ee 1/2 
s(t|s) = [si a— 2] = [19.89(1 — 0.071)]!/? ~ 4.3 em. 


Again, under the assumption of a Gaussian model, we can state that the chest 
circumference of a normal soldier must be within the limits 86.9 + 4.3 cm 
with a probability given by the 30 law. 


Fig. 6.14 Two-dimensional histogram of the data of Table 6.4 


6.16 Problems 257 


6.16 Problems 


6.1 An urn contains 400 red and black marbles in unknown proportions; 30 marbles 
are drawn at random, including 5 red marbles. Estimate, with CL = 95%, the 
initial number of red marbles R contained in the urn in the case of (a) extraction 
with replacement using the approximation for large samples, (b) extraction with 
replacement using the code binom. test and (c) extraction without replacement 
using the approximation for large samples. 


6.2 In an experiment, 20 counts have been obtained. Find the upper bound of the 
expected value of the counts, with CL = 90%. Use the Gaussian approximation of 
the Poisson distribution. 


6.3 The standard deviation of the weight of a population of adults is o = 12 kg. 
Find the mean (5S) and the standard deviation o[S] for a sample of N = 200 
individuals. 


6.4 By running an unknown number WN of tests, each one with a known a priori 
success probability p = 0.2, n = 215 successes have been obtained. Estimate N 
with CL = 90%. 


6.5 If X; @ = 1,2,...,N) is a random sample from a normal population, show 
that the covariance between the mean M and the deviation (X; — M) is zero. 


6.6 25 Gaussian variables with same mean yu (unknown) and o = | are summed, 
obtaining 245 as result. Find (a) the standard estimation interval for 4. (CL = 68%) 
and (b) the upper limit of 4. with CL = 95%. 


6.7 A sample of 40 Gaussian pairs has a sample correlation coefficient r = 0.15. 
Find the interval estimate p with CL ~ 68%. 


6.8 A Poissonian count gives x = 167 events in 10s. It is known that the average 
of the X counts is a function of both counting efficiency € on the signal and of the 


background b according to the equation jz = € v + J, that is: 


X ~ Poisson(ev + b). 


The € efficiency was determined as € + 0, = 0.90 + 0.10, while the background is 
estimated based on a value of b = 530 counts in 100s. Estimate the frequency value 
v of the source with error. 


6.9 Calculate the number N of bulbs you need to be 95% sure to have 1000h of 
light, knowing that each bulb has a negative exponential life expectancy with mean 
t = 100h. 


258 6 Basic Statistics: Parameter Estimation 


Assume to use one single bulb until it burns out and then to switch immediately 
on the next one, until the limit of 1000h is reached. Also assume the Gaussian 
distribution for the sum of the bulb life time. 


6.10 The expected background of a counting experiment (accurately measured 
during calibration) is 10 counts/s. In an experiment (background plus possible 
signal) 25 counts are recorded in a second. Using the Gaussian approximation, find 
the upper limit of the counts for the signal only with CL = 95%. 


6.11 A Gaussian sample of N = 25 elements has a measured variance s* = 18. 
Find the upper limit with CL = 95%. 


6.12 An electric cable has 30 defects every 20 km. Find the number of defects/km 
with its error. 


6.13 Having recorded 35 counts, find the lower event limit with CL = 95% using 
the routines poisson.test and PoissApp. 


6.14 Create three random Gaussian samples of with 50 elements using the R 
instructions x=rnorm(50), y=rnorm(50,5,1) and yl=3«x+y. Find the 
covariances and the correlation coefficients between the variables x, y and x, yl, 
both analytically and using the routine CorrelEst. 


6.15 Create three random samples with 100 elements from the uniform distribution 
with the R commands x=runif (100) , y=runif (100) and yl=2*x+y. Find 
the covariances and the correlation coefficients between the variables x, y1, both 
analytically and with the routine CorrelEst. 


6.16 In a famous experiment on the efficacy of aspirin [ET93], 104 cases of 
heart attacks occurred in a sample of 11,037 people who had been taking this 
drug for several years. In a control sample of 11,034 people who took a placebo, 
189 heart attacks occurred. In this kind of studies, the odds ratio OR is often 
considered, i.e. the ratio between the frequencies fj = 104/11037 = 0.00942 
and f2 = 189/11034 = 0.01713. In this case OR = fi /fo = 0.55, indicating 
that aspirin halves the probability of heart attacks. Find the confidence interval at 
CL = 90% of this data. (Hint: linearize the problem by applying logarithms.) 


Chapter 7 ® 
Basic Statistics: Hypothesis Testing peels 


There are two possible outcomes: if the result confirms the 
hypothesis, then you’ve made a measurement. If the result is 
contrary to the hypothesis, then you’ve made a discovery. 


Enrico Fermi, QUOTED IN: T. JEREMOVICH, “NUCLEAR 
PRINCIPLES IN ENGINEERING” 


No amount of experimentation can ever prove me right; a single 
experiment can prove me wrong. 


Albert Einstein, QUOTED IN: A. CALAPRICE, “THE ULTIMATE 
QUOTABLE EINSTEIN” 


7.1 Testing One Hypothesis 


In Exercises 3.13 and 3.17, we have already discussed hypothesis testing in the 
framework of probability theory. Let us now go deeper into this topic considering 
the testing (acceptance or rejection) of one hypothesis, called null hypothesis Ho 
or, more simply, hypothesis. The subject will then be completed later, in Chap. 10, 
where we will describe the criteria for optimizing the choices among several 
alternative hypotheses. 

We therefore consider the p.d.f. p(x; 6) of the variable X defined in Eq. (6.2), 
depending on an unknown parameter 6 and suppose that, on the basis of a 
priori arguments or previous experimental results, the hypothesis Hp is assumed, 
corresponding to a value 0 = 6. If an experimental value {X = x} is obtained, 
one needs to decide whether or not to accept the model related to this hypothesis, 
on the basis of this result. The scheme used is typically that of Fig. 7.1: as you can 
see, the variability interval of X is divided into two regions, a region favourable 
to the hypothesis and a complementary region, called critical region. The shape of 
these regions strongly depends on the type of problem considered, as detailed in the 
following. When the critical region is concentrated in only one of the two tails of 
the distribution, its limit is defined by the quantile values x, (left tail in Fig. 7.1a) or 
X{—q (right tail in Fig. 7.1b). 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 259 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_7 


260 7 Basic Statistics: Hypothesis Testing 


one-tailed a) one-tailed b) two-tailed C) 
P(X; 89) P(x; 00) P(x; 80) 
a=1-C a =1-CL a = 1-CL 
x % 
Xa Xj-¢ X a/2 Xj-a/2 
reject — accept hypothesis accept hypothesis reject reject accept reject ; 
hypothesis hypothesis hypothesis} hypothesis | hypothesis 


Fig. 7.1 One-tailed (a), (b) or two-tailed test (c). The confidence level CL is the value of the 
shaded area defined by the quantile values xy, x1, X@/2 and x}_q/2. The test levela = 1 — CL 
is the value of the tail area outside the confidence interval 


If, on the other hand, the critical region is formed by two disjoint subsets, as, for 
example, when one intends to reject a significant deviation from the mean value, its 
limits are defined by the quantile values xq/2 and x1-~/2 (see Fig. 7.1c). As we have 
already mentioned, the area a of Fig. 7.1 is called significance level, and we will 
denote it by SL. The pre-chosen value of the significance level w defines the a level 
of the test. 

When the event {X = x} is observed, under the hypothesis Hp corresponding to 
p(x; 60), it is then necessary to calculate the conditional probabilities: 


P{X < x|Ho} =a, , one-tailed test (to the left) (7.1) 
P{X > x|Ho} = ax , one-tailed test (to the right) (7.2) 
2min(P{X <x}, P{X >x}) =a, , two-tailed test (7.3) 


In the two-tailed test, the area a, /2 is the smallest area, to the left or to the right of 
the abscissa value, according to Fig. 7.1. 

The hypothesis is rejected when a, < a, that is when we obtain by chance values 
“within the tails”. With this rule, the probability that the decision is wrong when Ho 
is true is always less than a. This is called a type I error (to reject a true Ho), or 
also a false-positive case. The other possible error, to accept a false Ho, is known as 
type II error or false-negative case. This second case will be detailed described in 
the next Chap. 10. The value q@ is called size of the type I error. 

It should be noted immediately that all the previous definitions lose their meaning 
if we omit to specify the hypothesis being true, because, to calculate the significance 
level, it is necessary to know the p.d.f. corresponding to Ho. Within the frequentist 


7.1 Testing One Hypothesis 261 


framework, this requirement is not a trivial aspect but is the very essence of the 
definition! The probabilities given by Eqs. (7.1—7.3) have not to be confused with 
the probability that Ho is true for a given SL, which corresponds to P(Ho|X > 
x). Previously, in Chap. 1, we have already emphasized the dangers of inverting 
conditional probabilities without going through Bayes’ theorem. 

The logical approach applied to reject a hypothesis assumed to be true has found 
an important field of applicability in modern science. In fact, according to the 
philosopher Karl Popper [Pop59], a scientific theory can be differentiated from the 
non-scientific ones by the fact that it is falsifiable, that is, it can be verified by an 
experiment. In many cases, this procedure passes through the statistical analysis of 
experimental data and the frequentist hypothesis test (see also Sect. 12.17). The 
quotes of Fermi and Einstein, in the epigraph of the chapter, well represent this 
distinctive scientific thinking. 

Finally, it is important to emphasize that Ho cannot be chosen subjectively but 
must be determined only on the basis of the currently accepted scientific knowledge 
(the so-called state of the art). Therefore, before announcing a new discovery, it 
is necessary to demonstrate that it is not possible to explain the result obtained 
only starting from past experience. We have already implicitly applied this rule in 
Exercises 3.13-3.17. 

The observed significance level a,, defined in Eqs. (7.1-7.3), is also called 
p-value. The meaning assigned to the terms significance level, test level and p- 
value can be slightly different from text to text. Here, we will adopt the following 
terminology: 


oe 7 fixed before the experiment — test level 
significance level NS ; 

observed value from an experiment — p-value 

Basically, with this terminology, a hypothesis test can be summarized as follows: 
the hypothesis is discarded if the p-value is less than the level of the test a; it is 
accepted otherwise. 

By performing repeated experiments and calculating the p-value each time, we 
obtain a sample of the random variable p-value. In hypothesis testing it is crucial 
to know how this variable is distributed: the answer is given by 


Theorem 7.1 (On the p-value p.d.f.) The p-value follows the uniform density, that 
is, P ~ U(O, 1). 


Proof The proof follows immediately from Theorem 3.5, which states that cumu- 
lative variables C(X) are uniform variables. In the one-tailed test, one has P ~ 
U(0, 1), since the p-value is P = C(X) ~ U(0, 1) or P = 1— C(X) ~ UC, 1). 
In the two-tailed test, the tail smaller area, to the right or to the left of the measured 
value, is always distributed as U(0, 0.5). Since, in Eq. (7.3), we have defined the 
p-value to be the double of this value, we get again P ~ U(0, 1). Oo 


A hypothesis test can be performed also with a statistic T = t(X 1, X2, ..., Xn) 
that estimates the value of the unknown parameter 0, that is, with estimators. In 


262 7 Basic Statistics: Hypothesis Testing 


this case, the density of the estimator p(t; 09) must substitute the sample density 
p(x; 6) and the test procedure remains unchanged. 

There are no fixed rules to establish the test level w. Usually small values, about 
1-5%, are chosen, given that, as previously discussed, the probability of making a 
type I error must, in any case, be kept within an “acceptably low” level. However, 
when Hp is based on a theory or physical law that is very well experimentally 
verified (as, e.g. is the case of Newton’s law of universal gravitation), the scientific 
community requires a very strong contrary evidence to refute it, with a level a which 
can also reach ~ 10~°. This conservative position essentially serves to avoid a false 
discovery even in the presence of possible undetected mistakes both in measurement 
and data analysis procedures (see, e.g. [Lyo13, LW18]). 

If the experimental p-value is less than a, it is located in the tails of the reference 
density. In this situation, it is considered less risky to reject the hypothesis 6 = 6 
than to accept it. The opposite is interpreted as a statistical fluctuation, the deviation 
is attributed to chance, and the hypothesis is accepted. 

Finally, we note that in some cases, the comparison between test level and p- 
value is ambiguous: 


¢ For a discrete random variable, the p-values of two adjacent values can straddle 
the outcome of the test. For example, in the one-tailed test on the right, it can 
happen that: 


P{X > x1} = pi <a < po = P{X > x}, 


with x; > x2. In this case the hypothesis is accepted if {X = x2} is rejected if 
{X = x,}. Otherwise, if a uniform generator of variables 0 < U < | is available 
(e.g. a computer routine random), the test can be randomized with a probability 
p such that: 


a— pi 
P2- Pi 


PitP(po-piy=a => p= (7.4) 


Provided that the value {X = x} is obtained, the hypothesis is discarded if the 
routine random gives a value {U = u < p}; otherwise it is accepted. In this way, 
by construction, the experimental p-value exactly coincides with the a level of 
the test. 

¢ In the case of the two-tailed test, different critical values can correspond to a 
given level a, that is, pairs of tails with different lengths but with the same area 
equal to CL = 1 — a are possible. Usually in this case the two left and right 
extremes are chosen to subtend the same area a/2. This convention has been 
implicitly adopted in Eq. (7.3). 


In the following we will describe the three most important test categories: the group 
of Student’s tests (t-test and z-test), the x?-test and the test for the significance of 
variation sources, called ANOVA (ANalysis Of VAriance). 


7.2 The Gaussian z-Test 263 
7.2 The Gaussian z-Test 


This test is based on the standard Gaussian quantile, which is usually denoted as z: 


(7.5) 


If the standard deviation o is known, then the problem can be solved with the 
probability theory, verifying the compatibility of each single value with the values 
assumed as true. Examples of this technique were given in Exercises 3.13—3.17, and 
the same method can be applied in the case of sample means which, according to the 
Central Limit Theorem 3.1, can be considered Gaussian for N >20—30. However, 
when some density parameters are unknown, it is often necessary to perform the 
test using statistics. For instance, o is usually unknown and is then replaced with 
the experimental value s. We know, from Sect.6.11, that for Gaussian samples 
the quantile follows the Student’s density that can be basically considered to be 
Gaussian when the number N of events is N > 100. In these cases the z-test for 
Gaussian samples can be used. This test can also be extended to two values x1, x2 
when one needs to evaluate the difference wg = [41 — [42 between two means. 

Since the variance of the difference of two independent variables is given by 
Eqs. (5.66, 5.4), one can write: 


Var[ D] = Var[X1] + Var[X2] ~ s? +55, 


where, in the last step, the approximation for large samples was used (see Table 6.3) 
by replacing the unknown true variances with the measured ones. We can then define 
the standard value: 


pao (7.6) 


74 2 
V1 +52 


and calculate the significance level according to the Gaussian density. The case of 
small Gaussian samples where tp is a quantile of the Student’s density will be 
discussed in the next section. We also remind that (as shown in Exercise 5.3) the 
difference of two independent Gaussian variables is also Gaussian. The quantiles 
tp of Eq. (7.6) have to be identified with xa, X1-a, X«/2,X1-a/2 of Fig. 7.1, and the 
evaluation of the corresponding p-value must take into account whether the test is 
one-tailed or two-tailed. 

In R, the cumulative value corresponding to a (negative or positive) Gaussian 
quantile f is calculated by the pnorm (t) routine, which gives the area of the tail on 
the left of t. Given the symmetry of the distribution, the corresponding p-values are 
given, in the case of the two-tailed test, by p =2* (1- pnorm(abs(t) ) ), while 
in the case of the one-tailed test by p =1-pnorm (abs (t) ). Usually Eq. (7.6) is 
used, for large samples, to compare two frequencies or two sample means. In the 


264 7 Basic Statistics: Hypothesis Testing 


first case, we can write the two frequencies as f; = x;/N and f2 = x2/No, if My 
and N> are the number of trials of the two experiments. From Eq. (6.33), valid for 
large samples (N;, Nz > 100, x1, x2 > 10), one immediately obtains: 


: (7.7) 


i i= ie 
path p a fi) 4, Pl a 


NM N2 


with Gaussian probability levels, when p = pj — p2 is the probability difference 
under Ho. 

When p = 0, that is when under Ho the two samples come from the same 
binomial population, the frequency obtained by adding the two homogeneous 
samples is often used for the error calculation: 


,=2 (7.8) 


By inserting this estimate into Eq. (7.7) with p = 0, we obtain the pooled standard 
variable: 


1 1/2 
tp = (fi — f2) Jfa-/ (xtm)| (7.9) 


If Nj = N2 = N, the comparison can also be done in terms of number of hits x; and 
x2 rather than of probabilities, always using Eq. (7.6) with zg = 0 and the variance 
defined in Eq. (6.34): 


Laan [x (1 = a) 4x x2(I = ay (7.10) 


Our routine GdiffProp (x,n,p,pool,alt) performs the test between two 
frequencies using Eqs. (7.7, 7.9). The variables x and n are two bidimensional 
vectors containing the values x;,x2 and Nj, No, respectively. If pool=TRUE 
(default) Eq. (7.9) is used, which requires p = 0 (default value), otherwise Eq. (7.7) 
is used. If alt="two” and alt="one”, the two-tailed and one-tailed test p- 
values are calculated, respectively. 

To compare the difference between two true means “ = (41 — [2 with that of 
two means m, and m2, coming from two different samples, with variances Be, ae 
and number of elements N;, N2 respectively, from (Eq. 6.53), we get the standard 
value: 


2 2 1/2 
= “1 *2 (7.11) 
t = = ee ee, : ; 
‘im = (m1 — m2 — IL) ™ No 


7.2 The Gaussian z-Test 265 


Often 4; = (2, thatis 7 = 0, and the true means are unknown, but the compatibility 
between the two experimental means is tested. As we know, Gaussian probability 
levels can be applied when Nj, N2 > 100. 

The p-value corresponding to Eq.(7.11) is calculated by our routine 
GdiffMean(m,s,mu,alt), where m and gs are the two-dimensional vectors 
containing mj, m2 and s;/./N1, s2//N2, respectively; mu= ju (= O by default) 
and alt are defined as in GdiffProp. When m and s contain a single value, 
mu is the true population mean, s is the sample standard deviation, and the test is 
performed with Eq. (7.5). 


Exercise 7.1 
In the measurement of the same physical quantity, two groups of experi- 
menters obtained the values: 


3) = 112.3) ae OS - 
37 = 113),5) ae 0.8 


and reported that both measurements are of Gaussian type, since they have 
been obtained from large samples. Check if the results are compatible to each 
other. 


Answer Since we do not know the size of the samples used by the two groups, 
we must assume that the approximation for large samples is valid and apply 
the Gaussian test. 

The call to GdiffMean (c(12.3,13.5),¢(0.5,0.8)) gives ty, = 
—1.27 and a two-tailed p-value p = 0.203. Notice the call to the R function 
c() used to load the vectors with the experimental results. 

Since the probability to be wrong when rejecting the compatibility hypoth- 
esis is too high, the two measurements must be considered compatible, i.e. 
they cannot be questioned only on the basis of statistics. 


Exercise 7.2 

In an extrasensory perception experiment (ESP), five boxes, numbered from 
1 to 5, were prepared, and a target object was placed inside box number 3. 
Two hundred people were asked to guess which box contained the target and 
62 of them indicated just the number 3. In a control test with all the empty 
boxes, but letting the audience believe that a target was present, 50 persons 
pointed to box 3. Determine whether the experiment reveals ESP effects or 
not. 


(continued) 


266 7 Basic Statistics: Hypothesis Testing 


Exercise 7.2 (continued) 

Answer If we use probability theory, we can proceed as in Exercise 3.17 and 
assume as a null hypothesis that each of the five boxes has an equal probability 
of 1/5 of being chosen through a purely random guess. Then the expected 
theoretical distribution is binomial, of mean and standard deviation given by: 


w=40, o = /40(1 — 40/200) = 5.65. 


With 62 successes, the standard variable (3.37) is: 


i 6240) 
GS Se 


i 3.9, 


corresponding, from Table E. 1, to a very low significance level, i.e. < 1-107*. 
Therefore, the hypothesis that ESP effects are present seems confirmed by this 
test. 

However, experimental psychology states that the number 3 is psycholog- 
ically favoured (in general all odd numbers are favoured over the even ones). 
In other words, in a blank test with boxes numbered from | to 5, most of the 
choices should be on box number 3, even in the absence of a target, simply 
because number 3 is “nicer” than the others. 

It is then necessary, in the absence of an a priori model, to abandon 
the probabilistic approach and solve the problem in within the statistical 
framework, using only the number of hits with target (x; = 62) and the one 
without a target (x2 = 50). The statistical errors to be associated with these 
observations are obtained from Eq. (6.34): 


50 62 
s(50) = ,/50 (1 = =! =6.12, s(62)=./62 (1 — =a) = 6054). 


Now, combining the results of the blank test and of the target test in Eq. (7.10), 
we have: 


62 — 50 
ee el Be 


vy (6.12)? + (6.54)? 


corresponding, from Table E.1, to an observed two-tailed significance level 
(p-value) of 


P{|T| > 1.34} =2- (0.5 — 0.4099) = 0.1802 = 18%. 


(continued) 


7.2 The Gaussian z-Test 267 


Exercise 7.2 (continued) 
The routine GdiffProp(c(62,50),c(200,200)) gives the same 
result; for the one-tailed test, one has GdiffProp(c(62,50), 
c(200,200) ,alt=”"one” ), which gives the result p = 0.091. 
Therefore, this analysis shows that in about 18% of times, making guesses 
and being psychologically biased in favour of number 3, there may be 
deviations of more than 12 units (in excess or in defect) between the blank 
test with and the test with the target. The excess occurs in 9% of cases. The 
result is therefore compatible with pure chance and reveals no possible ESP 
effects at all. 


Exercise 7.3 
Evaluate the compatibility between the true and simulated values in Exer- 
cise 4.2. 


Answer This exercise refers to 10,000 simulated data pairs, from which have 
been obtained the values: 


r1, = 1.3081077, ro. =0.7050, r3 = —0.7131. 
The true values are: 
p=0, p2=0.7071, 3 = —0.7071. 


We transform both the experimental and the true values using Eq. (6.119) and 
the first of Eqs. (6.120), respectively: 


Ae 1308 10s) ee 0.8172) 2s 0.8995 
P= 0) Wo O88, = O.Rs 4: 


The z; values refer to a Gaussian variable with mean jp; and standard 


deviation: 
ee ee = ae = 0.0100. 
N-3 9997 


We then can define the three standard variables: 


eli Oa ee 
- Zi — M1 |Z2 — 2| [23 — 143| eee 
oO 


G0 pe ES 
(oy 


(continued) 


268 


Exercise 7.3 (continued) 

which all give high significance levels, as you can see from Table E.1. 
The data differ by 1.38, 0.42 or 1.21 standard deviations from their mean, 
indicating a good agreement between simulation and expected values. 


Exercise 7.4 

Ina work on the harmfulness of radio frequencies used by cell phones [F* 18], 
a sample of 1631 rats was irradiated for 2 years with radio frequencies of 
intensity comparable with those of base radio antennas. The pooled results, 
compared to the control group, were as follows: 


Sample Heart tumours Brain tumours 
Exposed rats 1631 13 19 
Control group 817 my 4 


Evaluate whether the data indicate some harmful effect. 


Answer In this case, as previously commented, Ho represents the non- 
harmfulness of the radio frequencies, that is, the homogeneity of the irradiated 
and control samples is assumed. For each tumour type, the data are binomial 
distributed. Due to the small values of x1, x2, we cannot apply Eqs. (7.7, 7.9) 
nor use the Student’s density. As often happens, small non-Gaussian samples 
are generally not analysable with statistical models of general validity. 
Fortunately, in these cases there are Monte Carlo simulation methods, and 
in fact we will come back to this problem in Chap. 8. 

Nevertheless, we can evaluate preliminarily the results under Gaussian 
approximation. We use our routine GdiffProp to perform the one-tailed 
test. We obtain the following results: 

GdiffProp(c(13,2) ,¢c(1631,817) ,alt="one”) : p=0.049, 
GdiffProp(c(19,4) ,c(1631,817) ,alt="one”) : p=0.051, 
indicating a p-value close to the Ho rejection limit that, in this kind of studies, 
is usually fixed around 5% or 1%. If we use Eq. (7.10) instead of Eq. (7.9), we 
obtain even smaller values: 
GGCliirieecejo(e(ls,A) ,e(l6si,8i7)) , pool=smWAL Sia, ellic=" Cia” )) 3 
p = 0.024, 

GCliirieercejg (e(S),4) ,@(ilSesi, Bild) , joxoolsivAujsi0,, eullic="Cia” )) 3 
p = 0.031, 

We will reconsider the analysis of these data in Sect. 8.13. 


7 Basic Statistics: Hypothesis Testing 


7.3 Student’s t-Test 269 


7.3  Student’s t-Test 


From Sect. 6.11, we know that in the case of a Gaussian sample with N events 
having mean m and standard deviation s, the value: 


m— 
re Lb 


aor, 


is the occurrence of a Student’s variable with N — 1 degrees of freedom, if ju is 
the true mean of the parent population. The test must then determine the p-value 
of the Student’s quantile and judge whether or not it is appropriate to discard the 
hypothesis. We notice that (7.12) is a valid test statistics for N > 2, since N = 2 
is the smallest sample size that allows us to calculate s. The only possible test for 
N = Lis that of Eq. (7.5), which requires the a priori knowledge of yz ando. 

In addition to the case of Eq. (7.12), where a Gaussian sample mean is compared 
to the population true mean value, the t-test is usually used also in two other 
situations: 


(7.12) 


1. To verify the difference between the means of two independent samples with 
respect to an expected one 

2. To check the mean of the differences of two dependent or paired samples with 
respect to an expected mean difference 


Let us now look at the first case. 

Given two independent samples with means m1, mz, standard deviations 51, 52 
and number of events N;, N2, respectively, the difference test considers the variable 
t as: 


a3 
Ni No 


This equation is formally identical to Eq.(7.11), but in this case, to compute the 
p-value corresponding to this Student’s quantile, it is essential to know the degrees 
of freedom v, which is anything but trivial. The following steps explain how to 
proceed, but they are not essential, and those who are not interested can immediately 
go to Eq. (7.19). 

We know, from Eq. (6.72), that (NV —1)s?/o7 follows a x? distribution with N —1 
degrees of freedom. 

Analogously, if: 


270 7 Basic Statistics: Hypothesis Testing 


we can affirm that vst je; is a x7 variable with v degrees of freedom. Since, from 
Eq. (3.69), we know that Var[xy] = 2, we can write: 


2 2 2 | Var[s? Var[s? 
ver = = 20 = Se vatshl= | villa le) gi 
on on on Ny N5 


To find the variances of a and co we can use again Eq. (3.69): 


2 
Var| Oa] =2(N—1) —> Var[s?] = v : (7.15) 


o2 
so that: 


4 4 
20; 205 


a 4  .. [Al 
N2(M— 1) NS(N2 — 1) oe 


Var[s7] = 


From the second and third term of Eqs. (7.14), we then have: 


2 4 2o+ 2a, 
== | (7.17) 


v a, NPM -1) N3(N2 —1) 
and hence: 
4 
vo ——— (7.18) 
O71 O 


N?(Ni—1) N3Z(N2— 1) 


Substituting the unknown quantities o; and o2 with the measured ones, we obtain the 
approximate Welch-Satterthwaite equation [Wel47], according to which the degrees 
of freedom are given by the integer nearest to: 


2 2\2 
cad 
Ni N2 
i teen NS (7.19) 


N2(M— 1) N3(N2— 1) 


Equations (7.13, 7.19), known as Welch’s t-test, satisfactorily solve the comparison 
between the means of two small Gaussian samples having different variances. These 
formulae have to be used for Nj, N2 > 2. 

To conclude, we consider now the case where the two means come from samples 
whose parent populations have the same mean and variance: 0; = 02 = o. In this 


7.3 Student’s t-Test 271 


situation, the standard variable of the difference becomes: 


ic —— = —. ; (7.20) 
a eee 
Ni N2 Ni N2 


Using Eq. (6.72), in which the variance is calculated with respect to the sample 
mean, we define the non-reduced ae 


on Ma DSF | a= 83 


2 


1 
= = =5 [om — 1)s? + (Np — 1)s3| . 2% 
1 2 


which, from Theorem 3.4, has Nj + N2 —2 degrees of freedom. Based on the results 
of Exercise 5.5, we can state that the variable (5.41), which now becomes: 


m, —m) o/N, + No —2 


ts = ee 
Lot fan -1)s? + a — 83] 
N N2 
- Ni — 1)s? + (No — 1) 83 
= m, —my2 . He (M ) sy (N2 ) 85 (7.22) 
1 5 iT Ni+tN-2 
S — — 
12 M1 N> 


follows the Student’s density with Nj + Nz — 2 degrees of freedom. Notice that, if 
in Eq. (7.18) one sets 01 = o2, the result v ~ Nj + N — 2 is not obtained, because 
here the Student’s variables have a different structure. Therefore, as a practical rule, 
Eqs. (7.13, 7.19) must be used when the variances can be different, whereas it is 
better to use Eq. (7.22), which has the correct degrees of freedom, if it is a priori 
known that the variances are equal. 

The p-value corresponding to the quantile (7.13, 7.19) or to the quantile (7.22) 
for different and equal variances, respectively, is computed by our routine 
TdiffMean(m,s,n,mu,alt,var), where m,s and n are two-dimensional 
vectors containing the mean, the standard deviation and the number of events of the 
two samples, respectively. The mean default value mu=y is 0. The two or one-tailed 
test type is selected through the variable alt="one” or alt="two”, whereas 
var=FALSE uses Eqs. (7.13, 7.19) and var=TRUE Eq. (7.22). As in the case of 
GdiffMean, when m and s are single-valued parameters, mu is the true mean, 
s is the sample standard deviation, and the test is made performed according to 
Eq. (7.12). 

The R routine t.test (x,y,mu,alt,var) compute the p-values when 
the raw data of two samples are contained in the vectors x, y. The variable 
alt="two” (default value) computes the p-value for the two-tailed test. If the 
value alt="less” or alt="greater” is set, the left- or right-tailed value is 


272 7 Basic Statistics: Hypothesis Testing 


respectively calculated. The variable var is the same as in TdiffMean. This 
routine also includes Eq. (7.12) as a subcase when the y array is missing. 


Exercise 7.5 
Verify Eqs. (7.13, 7.19, 7.22) using the R routine t.test. 


Answer We use here Monte Carlo methods that are described in the next 
chapter. In R, the rnorm(n,m,s) routine generates a vector of n Gaussian 
deviates with mean m and standard deviation s. These two vectors are 
passed to the routine t .test, and the variable p. value is read with the 
command t.test(...)$p.value (see Appendix B). With the R routine 
replicate, two samples of 10,000 simulated p-values are generated and 
stored in the vectors tpst and tpsf, corresponding to the cases of equal and 
different variances, respectively. From Theorem 7.1, if the method is correct, 
the p-value must follow the uniform distribution. These two samples are then 
passed to the R routine density which, using numerical methods, finds 
the shape of the parent population. These densities are finally plotted in the 
graphical window. 

This procedure is included in our routine TpTest which is given in the 
following: 


TpTest<- function(m1=0,m2=0,s1=1,s2=3,n1=10,n2=5) { 


#prepare graph 
pts = seq(0.,1,length=100) # points of the plot 
plot (pts,2.0«*pts,type='n’,xlab=’p-values’ ,ylab='’d(p)’) 


# generate p-values according to the two cases of variances 

tpst = replicate(10000,t.test (rnorm(n1,m=m1,s=s1), 
rnorm(n2,m=m2,s=s2) ,m1-m2,var=TRUE) Sp. value) 

tpsf = replicate (10000,t.test (rnorm(nl,m=m1,s=s1), 
rnorm(n2,m=m2,s=s2) ,m1l-m2,var=FALSE) $p.value) 


# add the plots of the p.value densities to the initial one 
lines (density (tpsf) ,type='1',1lwd=2) 
lines (density (tpst) ,col="red",type='1’,1lwd=2,1ty='dashed’ ) 


} 


Using this routine, and systematically changing the arguments (which we 
invite you also to do), we have obtained that Welch quantiles were correct 
for all considered cases, that is, Nj, N2 > 2, both for different and equal 
variances. For small samples (V1, N2 < 5) and in the case of equal variances, 
the distribution obtained with the t-test of Eq. (7.22) better follows the 
uniform density. Surprisingly enough, the quantile (7.22), which assumes 
equal variances, is reliable even with different variances when Nj = No, 
while it fails when Nj 4 No, as shown in Fig. 7.2. 


7.3 Student’s t-Test 273 


d(p) 


p-values 


Fig. 7.2 Distribution of 10,000 p-values for the t-test between two samples of different size (nj = 
10, n2 = 5) coming from populations with mean jz = 0 and different variances (o? =1, os = 3). 
The continuous line represents the Welch method using the quantile from Eqs. (7.13, 7.19); the 
dashed curve represents the p-value corresponding to the quantile (7.22). The plot shows that, in 
this case, the appropriate distribution is given by the Welch’s method 


We conclude this section with the use of t-test for paired samples. This procedure 
can be applied when all the data of the two original samples x; and x‘, having the 
same number of events i = 1,2,..., N, are available. Often this happens when 
two treatments have been carried out on the same group of objects or individuals. 
For example, we might have different blood test samples from people who were 
first given a placebo and then one or more (different) drugs. In this situation, the 
paired data test analyses the difference in response on a person-by-person basis, 
thus minimizing the effects due to discrepancies between individuals. With this 
procedure, a new sample of the differences (dj = x; — se = 1,2,...,N) is 
created, with mean and standard deviation mg and sq, respectively. 

In general, the null hypothesis to be falsified is that of an ineffective treatment. 
This implies the true mean of the paired difference sample to be null: {Ho : “a = O}. 
It is therefore necessary check if the variable: 


md 


sa/VN 


follows the Student’s distribution with N — | degrees of freedom. The smaller the 
p-value corresponding to fg, the smaller the probability of making a mistake by 
discarding Hp when it is true. When Ap predicts true means “, ~ [U2, defining 
Ld = [1 — [2 the tg variable becomes: 


Md — [ld 


sa/VN 


and the test procedure remains unchanged. 


i= (7.24) 


274 7 Basic Statistics: Hypothesis Testing 


In R, the f¢-test for paired data, when the vectors x and y have the same 
dimension, can be performed with the callt .test (x, y,paired=TRUE). 


7.4 Chi-Square Test 


With both the z and t-tests we can only check the compatibility of a single or a pair 
of sample values with a parameter of the parent population given under Hp. Using 
the x?-test or chi-square test, we can remove this limitation. 

The test is based on the Pearson’s Theorem 3.3, which states that the sum of 
squares of v independent Gaussian standard variables follows the x? distribution 
with v degrees of freedom. In this case, the quantiles x. ran used to calculate 
the significance levels, unlike the Gaussian and Student quantiles, assume different 
values, due to the asymmetry of the x* distribution. Here, we will also use the 
notation of Sect.3.8 and denote with Q a variable following the x7 distribution and 
with the symbol x (the same of the density) the numerical occurrences of Q. If 
this variable is divided by the degrees of freedom, the resulting reduced chi-square 
variable is denoted with Qpr and rie 

Therefore, the test is based on the value: 


n 


eG 
Fe Seca ce (7.25) 


i=l 0; 

where x; are variates coming from Gaussian densities with means jz; and variances 
oF The rationale of the test is that the variable (7.25) is distributed according to x? 
only if 4; and o; are the correct values, i.e. those assumed under Hp. The x?-test 
is very flexible and adapts to many different situations. For example, the expected 
value jz; may be a theoretical model jz; = f (x;), the expected content of the number 
of events in a bin of a histogram of central value x; (as we will see in the next 
section), the mean jz; = yu of a Statistical population from which the values x; were 
obtained, and so on. 

An important case where the x7-test applies is that of N Poissonian counts n;, 
which, as we know, can be considered as Gaussians for n; > 10. In this situation, 
a? = ;, and the test variable becomes: 


N oct pes. 
pay =. (7.26) 


Sometimes, when n; < 10, the continuity correction, also known as Yates correction, 
is used: 


N 
; — pi — 0.5)? 
r=) Se SS (7.27) 
j=1 Mi 


7.4 Chi-Square Test 275 


Frequently, the value n; is used in the denominator, thus obtaining the modified x: 
N 2 
-ypa" (nj — “Hi (7.28) 
i=1 


Equation (7.28) is usually applied when n; > 30. This condition guarantees that this 
variable is still x7 distributed when ju; is the true expected value. The test is valid 
also when there is only a single Poissonian sample having ww; = ju: 


N 7a 2 
G25) See. (7.29) 


A further simplification of the test occurs when the true value jz, which may be 
unknown, is replaced by the value of the sample mean m: 


fl 2 
=) omy (7.30) 


=1 


In this case, the test only verifies that the sample under examination is Poissonian, 
without making further assumptions about its true mean. Note that the use of m 
creates a linear relation between the data and therefore, according to Theorem 4.4, 
the variable (7.30) has N — 1 degrees of freedom. The study of this variable shows 
that it tends to follow the x* density for samples with number of events N > 30, as 
also shown in the Exercise 7.6 given below. 


Exercise 7.6 
Verify Eq. (7.30) with simulated data. 


Answer We use the same method of Exercise 7.5. With the routine 
Chi2Testm (presented below), a set of Poissonian counts is generated 
with rpois, and then the p-values pchi and pchif, associated to the x 
variables chi and chf of Eqs. (7.29) and (7.30), are evaluated with pchisq. 
Since the true lambda and the experimental values mexp are used for the 
means, the estimators are correct if the Gaussian approximation holds. Then, 
from Theorem 7.1, the p-values are uniformly distributed. 

The simulated distributions are obtained with the R routine density, 
that evaluates the p.d.f. from raw data with smoothing algorithms tuned by 
the parameter adj. With adj =0.5, a moderate smoothing is obtained that 
allows us to clearly recognize, in Fig. 7.3, the uniform shape of the obtained 
distributions. 


(continued) 


276 7 Basic Statistics: Hypothesis Testing 


Exercise 7.6 (continued) 


The results show that, for N = 10, Eqs. (7.29) and (7.30) follow the x? 
density. This figure clearly demonstrates the importance to assign the correct 
number of degrees of freedom in Eq. (7.30), when the sample mean is used 
and the variables are thus correlated. We suggest you to vary the number N = 
n of generated variables and check when the p-value distribution differs from 
the uniform density. 


Chi2Testm<- function (n=10,Nsim=1000,denom=FALSE,adj=0.5,1lambda=10) { 
chi <- seq(0,0,length=Nsim) #clear vectors 
chif <- seq(0,0,length=Nsim) 
pchi <- seq(0,0,length=Nsim) 
pchif <- seq(0,0,length=Nsim) 
#prepare graph 
pts = seq(0.,1,length=100) # points of the plot 
plot (pts,1.5«*pts,type='n’ ,xlab='p-values’ ,ylab='d(p) ’) 
# simulate Nsim p-values 
for(i in 1:Nsim) { 
Z <- rpois(n,lambda) # n Poisson data of mean lambda 
mexp=mean (z) 
sigma2 = lambda 
if (denom==FALSE) sigma2=mexp 
for(j in 1:length(z)) { 


chili] = chili] + (z2[j]-lambda) « (z[j]-lambda) /lambda 
chif[i] = chif[i] + (z[j]-mexp) « (z[j] -mexp) /sigma2 
pehili] = pehis¢q(chi [i] ,n) # chi2 p-values 


pehif[i]= pchisq(chif[i],n-1) # n-1 degrees of freedom 


} 


# final plots 
lines (density (pchi, adjust=adj),type='1',col='red’ , lwd=2) 
lines (density (pchif,adjust=adj) ,type='1’',1lwd=2,1='dashed’ ) 


1.5 


d(p) 
d(p) 


p-values p-values 


Fig. 7.3 To the left: p-value distribution from a sample with N = 10 Poissonian counts with the 
test variables (7.29) (full line) and (7.30) (dashed line). To the right: the same result is shown when 
in Eq. (7.30) N degrees of freedoms are wrongly used instead of N — 1. In this case, the obtained 
p-value (dashed line) differs from the uniform density 


7.5 Compatibility Check Between Sample and Population 277 
7.5 Compatibility Check Between Sample and Population 


In this section we complete the study of the sample (6.97), already analysed in 
Sect. 6.14, by using the x7 test. The mean and variance calculated from data are: 


10 10 


1 
inj = 70.09, s* = — ; —m)* nj = 95.40. 
At Ss 999 Dt m)* nj 


1 
n= — 
1000 


The lo confidence intervals for mean, variance and standard deviation, calculated 
from Eqs. (6.53, 6.68), are: 


wemt = 70.1403, 


2 
o* € 5*+5*,/ —— = 95.44.43, 
N-1 


eee te —— 997 40.00: 


V2(N — 1) 


Since, for N = 1000, Gaussian confidence levels can be applied, we can say that 
these results are compatible with the true values uy = 70 ando = 10. 

Now let us check if the sample comes from a Gaussian density, assuming that the 
mean and standard deviation values are the true ones, 4 = 70,0 = 10. 

The expected or true number of events per bin are then given by Eq. (6.98), where 
p(x) is the Gaussian, N = 1000 and Ax = 5: 


_ 2 
exp | | dx 


eae ua "3a 


Ax oV/ 200 


as 1 exp| ( ) 5 ( 31) 
Jf 20 


10/27 200 


In columns (1)-(7) of Table 7.1, we have reported the following quantities: the 
spectrum value in the bin midpoint (1), the observed number of events (2), 
the expected number of events from Eq. (7.31) (3), the statistical error 5; from 
Eq. (6.102) (4), the standard deviation: 


on fn 
i=,/Hi N 


of the number of events (5), the values of the standard variable: 


Ini — bil 
{= ———, 
Si 


278 7 Basic Statistics: Hypothesis Testing 


Table 71 The data of the I 2 3 4 [5 6 7 
histogram of Fig. 6.11 are 

compared to the values of the 
Gaussian density (7.31) 37.5 Il 1.0 | 1.0 | 0.00 | 0.00 


62.5 | 152 | 150.6 | 11.3 | 11.9 |0.12 | 0.12 
67.5 | 186 | 193.3 | 12.3 | 12.5 | 0.59 | 0.58 
72.5 |207 | 193.3 | 12.8 | 12.5 | 1.10 | 1.10 
775 |153 | 150.6 | 11.4 | 11.3 [0.20 |0.21 
82.5 | 101 | 91.3 9.5 | 9.1 | 1.02 | 1.06 
87.5 | 42 | 43.1 6.3 | 6.4 |0.17 | 0.17 
92.5 7 | 15.8 2.6 | 3.9 |3.38 | 2.26 
97.5 6 |45 2.4 | 2.1 |0.90 | 0.71 


related to the observed number of events in each bin and calculated with the 
statistical error (6), and finally the values: 


calculated with the true standard deviation (7). In Fig. 7.4 the histogram of Fig. 6.11 
is shown (columns 1, 2 and 5 of Table 7.1) with the addition of the continuous 
line that represents the expected (true) values coming from the Gaussian (7.31) and 
computed in column 3 of Table 7.1. 

A first evaluation of the agreement between data and model can be made by eye: 
approximately 68% of the experimental points should “touch”, within an error bar, 
the corresponding true value. In our case, from column 7 of Table 7.1 and from 
Fig. 7.4, it results that, over a total of 13 points, the agreement with the expected 
values is found in 9 points (69%) within +0;, 12 points (92%) within +2 0; and 13 
values (100%) within +3 oj. 

As you can see, the fluctuations seem to agree very well with the percentages 
given by the 30 law (3.35), indicating that the density assumed as a model gives a 
correct representation of the data. 

We reach the same conclusions also using the statistical errors, that is, the stan- 
dard variables of column 6. The only anomalous channel is the one corresponding 
to the value x; = 92.5, which in this case provides an estimated standard value of 
3.38, compared to a correct value of 2.26. The disagreement originates from the low 
content of the channel, which has only seven events. For this reason, if statistical 
errors are used, discrepancies greater than 30 are generally accepted in channels 
having less than ten events. 


7.5 Compatibility Check Between Sample and Population 279 


Fig. 7.4 Comparison 
between the histogram of 
Fig. 6.11 and the expected 200 
values from a Gaussian with 
je = 70 and o = 10 


150 


100 


50 


30 40 50 60 70 80 90 100 


We now perform the x?-test, (7.26) assuming as null hypothesis the true bin 
probabilities p; of Eq. (7.31): 


K 2 K 2 
2 (nj — Li) (nj — Npi) 
r= = ae ae (7.32) 
i=l Mi = NPi 


Equation (7.32) is approximately the sum of squares of K independent standard 
Gaussian variables when the total number of events N is variable, and the sum is 
made on bins with more than 5—10 events. Since, for a Poisson distribution, oa? — 
Li, from Pearson’s Theorem 3.3, we get that this sum follows the x? density (3.66), 
with K degrees of freedom. The integral values of the reduced x density (3.72) are 
reported in Table E.3. 

If, on the other hand, the total number N of events is constant, the variables of 
the sum (7.32) are correlated, but the Pearson’s Theorem 4.6 still assures us that the 
result is a x? variable, but this time with (K — 1) degrees of freedom. 

Note that, when N is constant, it is wrong to write: 


K 
2 (nj — Npi)? 
; ———_——— (wrong!) , 
N pil — pi) 
because this is the sum of squares of correlated variables. 

In conclusion, it is always necessary to add the square of the differences between 
the observed frequencies and the true ones and divide by the true frequencies, taking 
care to remember that the degrees of freedom are equal to the number of channels if 
N is a Poissonian variable, while they must be decreased by one unit if N is constant. 


280 7 Basic Statistics: Hypothesis Testing 


All these rules derive from Pearson’s Theorem 4.6 applied to the statistical analysis 
of histograms. 

Often, with this test, one tries only to identify a deviation towards large values, 
i.e. towards the right tail of the expected distribution, when the null hypothesis is 
assumed to be true. In this case, the p-value is equal to: 


SL=P{Q>x’(v)}. (7.33) 


However, as we will discuss below, too small x? values, where the model fits the 
data very well, are often suspect. In this case it is advisable to perform a two-tailed 
test (see Eq. 7.1), as shown in Fig. 7.5, doubling the smaller area to the right or left 
of the quantile value x* or ue: 


SL =2P {QW > cw) if P { QW) > Cw) <0.5, (7.34) 
SL=2P {OW < ew) if P { QW) = eo) > 0.5. 


We now perform the x7-test on the data of the histogram (6.97). Since the first bin 
only contains one event, we group the first two bins and sum over the 11 remaining 
ones. The value of the reduced x” obtained from the data of Table 7.1 is given by: 


B 
1 | (uy +2 — 1 — 2)? S (ni — pi)” 8.987 
2 
(11) = — | ——————_ + ) ——— ] = —— = 0.82. 
sa bi + 2 dX, Hi 11 


(7.35) 


Since WN is fixed, the number of degrees of freedom is the number of the elements 
in the sum minus one, that is 11. 

We can then state that, with the calculated x? value, we have obtained the most 
comprehensive synthesis, because the results of Table 7.1 and Fig.7.4, needed 
to compare data and model, are squeezed into a single value, in our case 0.82. 
Obviously, if we repeated the experiment, we would get a different result, because 
the xe of Eq. (7.35) is the value assumed by a random variable of density given in 
Table E.3. 

We now finally proceed to the x? test. In Table E.3, in the row corresponding to 
11 degrees of freedom, we search for the area corresponding to the value of 0.82: 
we find a value of about 60%. From Eq. 7.34, this area corresponds to a p-value for 
a two-tailed test of: 


SL =2P{Qr < 0.82} ~ 2(1 — 0.60) = 0.80. 


In R, the same result is obtained by requiring the cumulative value with the 
command 2*pchisq(8.987,11)=0.754. If the model holds, xe values 
smaller than this one are possible in at least 40% of experiments. Since a 80% 
significance level makes the type I error highly probable, the hypothesis must be 
accepted. In conclusion, we generated an artificial sample of 1000 events from a 


7.5 Compatibility Check Between Sample and Population 281 


Fig. 7.5 Observed a) b) 
significance level for a 

two-tailed test (shaded areas) 

when the experimental x 7 

value corresponds to a) 

P{Or > xp} < 05 

probability or b) 

P{Or > Xz,} > 0.5. The 

shaded areas are equal 


2 
XR XR 


PQ> X2)<0.5 PQ>%,) > 05 


Gaussian distribution with 4, = 70 and o = 10. Next, we performed statistical tests 
with respect to the true density. These tests showed good agreement between data 
and model. 

As you can see, the logic of the x*-test is exactly the same as that used in 
the previous tests. The differences just involve only the variable type and the test 
function. The only point to be careful about is that the x? density, unlike the 
Gaussian, is not symmetric. It is therefore necessary to take into account both 
quantile areas x2 /2 and Ye o/2" Too high or too low x? values require further 
considerations: in both cases the test indicates that the fluctuations of the data 
around the values assumed to be true are not purely statistical. When 


P{Qr > xp(v)} < 0.01, 


the reduced x? value is in the right tail of its distribution curve and is too high. In this 
situation, the inadequacy of the parent population assumed as the model is highly 
probable. It may also be that the errors assigned to the data have been miscalculated 
and are underestimated. If errors have been correctly evaluated, the result of the test 
is the rejection of the hypothesis. Much more rarely, it could happen that the x2 
value is too small: 


P{Or < xp(v)} < 0.01. 


This may be the case when the a priori probabilities p; are evaluated from a density 
which tends to interpolate the data due to an excessive number of parameters, or 
when experimental errors have been erroneously overestimated. We will learn more 
about these concepts in Chap. 11. Often the x* of histogram data is calculated 
by dividing by the measured frequencies instead of the true ones, thus applying 
Eq. (7.28): 


K 2 
2 (ni — Npi) 
= —_— “Ts 
x » 7 (7.36) 


282 7 Basic Statistics: Hypothesis Testing 


Table 7.2 Histogram of the experiment consisting in N = 100 trials, each of them made of 
ten coin tosses (see also Table 2.2). The columns contain the possible number of heads (1); the 
number of times (successes) in which the number of heads reported in the first column has been 
obtained (2); the mathematical expectation, or true number of events, given by the total number 
of trials times the binomial probability of Eq. (2.29) with n = 10, p = 1/2 (3); the histogram 


statistical errors calculated with the expected number of events oj = ,/{4; ( - ft) (4) and with 


the measured one sj; = ,/nj ( = +t) (5); the x? values for each bin, obtained using the true 
probability (6); and the measured frequency (7) 


Spectrum Suc- Bino- Std. dv. Std. dev. x7 x7 
(n. of heads) cesses mial “(true)” (estimated) “(true)” (estimated) 
Nj Mi Si 

Li nj 


Xj 
i psi 
[10 [ooo ‘foo —*— 
2 008 _*(007 
5 014 (0.13 
4 a CC 
5 2s (m6 [43 fas —_—f0l 000 
6 24 [ans [40/43 foo 051 
7 ia [ut [s2_-as——~dioas 038 
5 ré [4a [20 [24 —iose—_—‘oas 
9 Ape f.0 0.00 Jon 
i “0 [00 [00 (00 —*([000 _—‘|o.0 


In this case, the denominator is approximated, even if model independent. Therefore, 
the division by the frequencies expected from the model is more correct and 
consistent. However, if only channels with more than five to ten events are taken 
into account in the hypothesis test, the use of the measured frequencies almost 
always leads to equivalent results. In tests with minimization procedures, which 
we will describe in Chaps.10 and 11, the measured frequencies are often used in 
the denominator. This choice greatly simplifies this type of algorithms since the 
denominator, being model independent, remains constant during the process of 
model adjustment to the data. 


Exercise 7.7 

Analyse the 10 coin experiment of Table 2.2 and Fig. 2.1, assuming inde- 
pendent trials and that all the coins have an a priori probability head/tail of 
1/2. 


(continued) 


7.5 Compatibility Check Between Sample and Population 283 


Exercise 7.7 (continued) 

Answer The data are shown again in the new Table 7.2, where some new 
values useful for the analysis have been computed. We define the input 
parameters: 


— Number of tosses per trial: n = 10 
— Number of trials: VN = 100 
— Total number of tosses: VN -n = M = 1000 


The first test is to check the total number of successes that is the total number 
of heads. From the first two columns of the table, we obtain: 


x=2-5+3-13+---=521_ successes. 


Since, under Ho, the expectation value is Mp = 500 and the standard 
deviation, from Eq. (3.5), is 0 = /500(1 — 1/2) = 15.8, we obtain the 
standard value: 

x — Mp 521 — 500 


SS SS 18 
V¥Mp(i — p) 15.8 


corresponding, in Table E.1, to a p-value: 


P{|T| = 1.33} = 2- (0.5000 — .4082) = 0.1836 X 18.4% . 


Therefore, we can affirm that, in repeated experiments where 1000 well- 
balanced coins are flipped independently, in about 18% of times one can 
observe deviations greater or smaller than the expected average (500) of more 
than 21 units. 

In Exercise 2.6 we computed the mean and variance from the histogram of 
Table 7.2: 


m=5.21, s* =2.48, s= 1.57, 


and the corresponding expected value from a binomial density with parame- 
tersn = 10 and p = 1/2: 


WES oo = 
ep ale G =250 @ = IS. 


The test on the mean gives, using Eq. (6.50): 


ij 00 5 2 500 


— — i 
Ei © nae 0.157 


= 1,34). 


(continued) 


284 


Exercise 7.7 (continued) 
This result is equal to the one obtained in the previous test on the total number 
of successes, because the identity: 


Mp-x — Nnp—Nx/N _ np —x/N _ wom 
Vv Mp — p) N2 Jap p)/VN a/VN 
= pl p) 
N 


holds. For the test on the variance, we can use the large sample formula (6.68) 
and compute the standard value: 


2505248 250-248 


-_ > £0.35 
25 ——— 
Nea 


which gives from Table E.1 a p-value: 


= 0.06 , 


P{|T| < 0.06} = 0.95 = 95% . 


In the end, we arrive at the x 2 test, which is the final test on the overall sample 
shape. By grouping the first three and last three channels, so as to always have 
a number of events per bin > 5, we obtain: 


1[(5—5.4)2 (13— 11.7)? = S.A? 5.18 
2) enya’ eae seers —— | (05 2 
XR) = & 5.4 7 5.4 eae 


The number of degrees of freedom is 6, because the total number of events is 
fixed and there are 7 terms in the sum. Using Table E.3, we determine that the 
significance level corresponding to this ie value is: 


1 — P{Op > 0.86} ~ 0.48. 


As a matter of fact, a value {Qr = xz} near the most probable one has been 
obtained. Also the call pchisq(5.18,6) gives a p-value = 0.48. 

If, in the x7 calculation, we had divided by the measured frequencies, as in 
Eq. (7.36), we would have obtained x2 = 7.42/6 = 1.24 and a significance 
level of about 28%. The two results, although similar from a statistical point 
of view, differ significantly due to the rather small sample size (V = 100). 

Very high significance levels were obtained in all the previous tests. This 
demonstrates a good statistical agreement between the data and the model 
assuming independent tosses of fair two-sided coins. 

The experimental data with their statistical errors, and the theoretical 
values given by the binomial distribution, are also shown in Fig. 7.6. 


7 Basic Statistics: Hypothesis Testing 


7.6 Hypothesis Testing with Contingency Tables 285 


Fig. 7.6 Experimental data 30 
with error bars and values of 

the binomial distribution n(x) 
(empty squares) for 100 trials 
each consisting in 10 fair coin 
tossing. To guide the eye, the 
discrete points of the 
binomial density have been 20+ 
joined with a line 


25> 


15- 


10 


7.6 Hypothesis Testing with Contingency Tables 


So far, we have described how to apply the x7-test when comparing a histogram 
with a parameter-dependent density model. Now let’s see how to modify this 
procedure when comparing two or more samples, without assuming a specific 
density function for their population. These tests are called non-parametric. First 
of all, we note that, if the experimental data consists of a single frequency 
corresponding to a number of successes > 10, the use of the x*-test is equivalent to 
the use of the two-tailed Gaussian test on a standard variable. Indeed, if we consider 
the variables 


xX— X — py? 
f= S.. acca 
(oy (oy 


we see that the first one follows the Gaussian density and the other one the x? 
density with one degrees of freedom. You can easily check this fact by randomly 
assigning a value of T and performing both the two-tailed Gaussian test with 
Table E.1 and the one-tailed test for the variable Q(1) with Table E.3: identical 
results are obtained. 

In the case of a pair of Gaussian variables, the compatibility test can be performed 
using the Student or Gaussian density, according to the difference method of 
Sects. 7.2 and 7.3. Alternatively, if the experiment determines how often an event 
occurs in two independent samples, the analysis of contingency tables with the 


286 7 Basic Statistics: Hypothesis Testing 


x°-test is often used. This methods requires the creation of a table containing the 
number of successes ng and np obtained with N, ed N;j trials, respectively: 


Successes | Failures Total 
Sample A | nq Na — Na | Na 
Sample B | nj, Np — Np Np 
Total Ng + np Nat Np —Na —1py | Nag tN = N 


Assuming that the two samples come from the same stochastic process with true 
probability p, the expected contingency table is: 


Successes | Failures Total 
Sample A | pNa (1 — p)Na Na 
Sample B | pNp | (1 — p)Np Np 
Total D(Na + Np) | A — p)\(Na + No) | Na + Np =N 


Each row of the experimental contingency table is a two-bin histogram, and the 
associated expectation table provides the corresponding expected value of the 
number of successes and failures. From Pearson’s Theorems 4.4 and 4.6 and from 
the x* additivity Theorem 3.4, the quantity: 


2 (Ma—PNa)” , (Na—Ma — (1 — p)Nal? 


x= + 
DPNa d — P)Na 
_ 2 —np —(1— p)Ns]? 
(np — PNb) [Np — ny — 1 — p)No] (7.37) 
DNb (1 — p)No 


can be considered as a x” variable with two degrees of freedom. 

If p is unknown and the null hypothesis just states that it is the same for the 
samples A and B, a point estimate can be calculated from the data. If the observed 
successes and failures are > 10, then one can set: 


Na tNb — Na tN 


ies (7.38) 
Na + Np N 


p= p= 
Since this assumption introduces a further dependency relation between the data, 
according to the Definition 6.3, Eq. (7.37) represents the values assumed by a x 
variable with a single degree of freedom. Notice that the method is approximated, 
because the true probability has been estimated from the observed frequency (7.38). 
The tendency towards the x? density of the variable (7.37) obviously holds for 
Na, Np —> ©. However, the method is accepted and gives good results for 
N > 40, na,np > 10. This condition assures an approximately Gaussian number 


7.6 Hypothesis Testing with Contingency Tables 287 


of successes and reliable estimates of the probability. This point has been previously 
discussed in Sect. 6.7. 


Exercise 7.8 

To prove the validity of a vaccine, two groups of guinea pigs were studied, 
one vaccinated and the other unvaccinated. The results are reported in the 
following contingency table: 


Sick Healthy Total 


Vaccinated 10 4] 51 
Unvaccinated 18 29 47 
Total 28 70 98 


Does the vaccine pass the right-tailed test at the 5% level? 


Answer If we assume as a null hypothesis Ho that the vaccine is not effective, 
then the differences between the two groups of guinea pigs are only due 
to statistical fluctuations. Under this hypothesis, the best estimate of the 
probability of contracting the disease will be given by the frequency (7.38): 


a 28 
P= — = 0.286 ~ 29% . 
98 
Consequently, the probability to stay healthy is equal to 0.714. We can then 
construct the expected contingency table (i.e. the table of expected values 
under Ho: 0.286 - 51 = 14.6, etc.): 


Sick Healthy Total 
Vaccinated 14.6 36.4 51 
Unvaccinated 13.4 33.6 47 
Total 28 710 98 


From Eq. (7.37) one gets: 


2 (O- 14.6)? (18 — 13.4)? (41 — 36.4)? (29 — 33.6)? 


tS 
14.6 13.4 36.4 33.6 


From Table E.3 we deduce that, with one degree of freedom, a value x7) — 
3.84 corresponds to a p-value=5%, whereas the obtained value, x7(1) = 
4.24, corresponds to p ~ 4%. We conclude that the vaccine passes the 


(continued) 


288 7 Basic Statistics: Hypothesis Testing 


Exercise 7.8 (continued) 
efficacy test at the 5% level, because the probability to be wrong by discarding 
a true Ho hypothesis, stating that the vaccine is effective, is only 4%. 

These calculations can be performed in R with the chisq.test rou- 
tine, which can be used straightforwardly for contingency tables. If the 
table is loaded by rows with the rbind routine, with the simple com- 
mand chisq.test (rbind(c(10,41), ¢c(18,29)), corr=F) the 
values X-squared=4.187, df=1, p-value=0.0407 are obtained. 
The corr=F condition excludes the Yates correction (7.27). Since a 2 x 2 
contingency table has been used for this problem, we can alternatively use the 
method of Sect. 7.2 and the pooled formula (7.9). 

With the call GdiffProp(c(10,18),c(51,47) ,pool=T), we 
obtain the values: quantile tz=-2.046, p-value for a two 
tailed Z test=0.041. The non-pooled formula(7.7) gives a p-value of 
0.038. With this method the expected contingency table is not used. The value 
tz* = 4.19 corresponds to a x” with one degree of freedom in agreement 
with the value 4.24 previously found with the expected contingency table. We 
would have obtained two identical values if in GdiffProp we had used, 
instead of the experimental frequencies, those of the expected contingency 
table. It should be noted that both methods are approximate: the method using 
the expected contingency table implies that the true probabilities coincide 
with the frequencies calculated from the data under the assumption that the 
vaccine is ineffective, while the difference method uses the statistical errors 
calculated from the experimental frequencies instead of the true errors. 


We now show how the x? test can be extended to contingency tables of any size. 
In general, we can consider the following contingency table: 


Channel / Channel2 --- Channelc Total 
Sample 1 Ni n\2 age Nic Mi 
Sample 2 n21 n22 stale N2¢ pay n2; 
Sampler ny Ny2 “+ Are yj Ny; 
Total >; Nil »; Nj2 tee >; Nic ij nij =N 


which is composed by r rows (the histogram of the sample) and c columns (the 
histogram bins). When only one histogram is present, this test can be performed 
with the method of Sect. 7.5. 

As usual, we want to check whether the samples are homogeneous, that is, if 
they come from the same parent population. If this null hypothesis is true, then 
we can associate a true (unknown) probability p; with any column of the table. 


7.6 Hypothesis Testing with Contingency Tables 289 


After multiplication of any sample row by the total number of elements )~ jNij> 
this probability gives, in any cell, the expected number of events p; N;. Taking into 
account the Pearson’s Theorems 4.4 and 4.6 and the x? additivity Theorem 3.4, we 
can say that the quantity: 


2 2 
2 = > (nij Pj 24 Nik) = > (nij iPj) (7.39) 
r Dj dix Nik F NiPj 


is x7 distributed. 


The unknown value of the probability can be estimated from the data using 
Eq. (7.38) and summing by rows: 


- 
) Nij 
A i=1 


| N 


; (7.40) 


where N is the total number of events of the table. 

Now we come to the calculation of the degrees of freedom. From Pearson’s 
Theorem 4.6, we conclude that every row contributes with (c — 1) degrees of 
freedom. However, this number, which is r(c — 1), must be decreased by the number 
of Eqs. (7.40) that have been used for the estimation of the true probabilities. These 
relations are (c— 1), because the probability of the last column is fixed by the closure 
relation: 


Therefore, the total number of degrees of freedom is: 


tHre=) == HNS—)e=1. (7.41) 


In conclusion, given a predetermined significance level SL = a (usually a = 0.01- 
0.05), on one or more histograms collected in a contingency table with (ij = 


1,2,...,r) rows and (j = 1,2,...,c) columns, one proceeds as follows. The 
reduced y?: 
2 1 (ij — Nip)? 1 "ii 
XR) = - = SS ——-N], (7.42) 
. v dX Ni Pj u dX Ni Pj 


v=(r-—D(c-1), (7.43) 


290 7 Basic Statistics: Hypothesis Testing 


is calculated, where N; are the row totals, N is the total number of events of the table 
and pj is given by Eq. (7.40); Ho is rejected at a level a if from Tables E.3, E.4 or 
from the routine pchisq a p-value < a is obtained. 

Note that, in the non-parametric case, the hypothesis is not rejected when the 
x7 is too small, since the case where the a priori density model contains too many 
parameters is here inapplicable. The test is therefore always one-tailed, on the right 
tail. However, it should be noted that a too small value of a corresponding to 
cumulative values < 0.01, indicates a suspect coincidence between the experimental 
and expected contingency tables, usually due to non-random fluctuations or to 
correlations among data. In these situations, the null hypothesis should be accepted 
with caution, especially if there is a large number of degrees of freedom. In this 
regard, we recommend to solve Problem 7.5. 

Could we avoid using the x?-test and always resort to the Gaussian difference 
test as done for the (2 x 2) contingency tables? The answer is negative, and we want 
to explain why. In the case of contingency tables with more than two dimensions, we 
could think of making a sum of frequency differences divided by the relative error, 
as in Eq. (7.7). If we remove the absolute value of the difference, the sum of these 
random variables will also be Gaussian, and we could perform the significance test 
with Table E.1. However, a sum of large fluctuations (positive and negative) could 
provide a good value of the standard variable even in the case of a definitely wrong 
hypothesis. If, to avoid this inconvenience, we used the absolute value in the sum 
of the differences, then we would no longer get at the end a Gaussian variable, 
and therefore we would not be able to perform the test. On the other hand, the x? 
variable accumulates all the squared fluctuations being a sum of squares of the data 
with respect to the true value and is therefore more reliable. In technical language, 
we can say that the x? test, compared to the Gaussian difference test, minimizes the 
type II error (to accept a wrong alternative hypothesis) and that therefore, based on 
the terminology that we will introduce in Chap. 10, is a more powerful test. 


Exercise 7.9 

In a factory, the production of a rubber timing belt is checked by three 
operators X, Y and Z. They perform a visual test on the quality of the product 
and may accept the piece or discard it for a type A or B defect. The work of 
the operators is summarized in the following table: 


Type A Type B Good Total 


Operator X 10 54 26 90 

Operator Y 20 50 50 120 
Operator Z 30 35 35 100 
Total 60 139 111 310 


(continued) 


7.7 Multiple Tests 291 


Exercise 7.9 (continued) 
Determine whether the control criteria of the three operators are homogeneous 
at a 1% level. 


Answer From Eq. (7.40), the following probabilities are obtained: 


60 139 11 
= OOS ne yr Rees ane 
PA ~ 310 a0 Pgood ~ 319 


which allow us to evaluate the expected contingency table: 


Type A TypeB Good Total 


Operator X 17.5 40.4 B22 90 

Operator Y 23.2 53.8 43.0 120 
Operator Z 19.3 44.8 35.8 100 
Total 60.0 139.0 111.0 310 


When repeating these calculations, you may find some small differences 
compared to the above values due to rounding effects. 

We can use here the R routine chisq.test, uploading the data by row 
with the routine rbind through the call: 


chisq.test(rbind(c(10, 54, 26), c(20, 50, 50), c(30, 35, 35))), 


which gives the results: X-squared=18.877, df=4, 
p-value=0.00083. Indeed, from Eq. (7.43), it results that the degrees of 
freedom (d£) are: 


=e hie = ieee 


Because of the very small p-value, we can safely conclude that these operators 
use different test criteria. Therefore, the homogeneity hypothesis must be 
rejected at the 1% level. 


7.7 Multiple Tests 


Besides the p-value, the multiplicity is the other crucial parameter of a test. For 
example, we can generate a vector x of 300 Gaussian deviates with the R routine x 
<- c(rnorm(300) ). With the x7 test, we can easily verify the hypothesis Ho 
that this sample originates from the standard normal curve N(0, 1). In fact, given 
x, you can calculate the x? value of Eq.(7.25) with uw; = 0, oj = 1, Vi. Then, 


292 7 Basic Statistics: Hypothesis Testing 


Fig. 7.7 Probability Pp to ~ 1F 
that at least one test belongs Pa [ 
to the rejection region as a 
function of the number m of 0.75 - 
tests when all the hypotheses L 
are true 0.5 5 : 

r 

0.25 | . 
e 
Ey 
Q Cebit 
0 10 20 30 40 50 60 70 80 90 100 


m 


with xchi <- 1-pchisq(sum(x*2) ,300), you will obtain a large p-value, 
supporting the validity of Ho. Equivalently, you can also verify that a sample of 
p-values xchi generated in this way will perfectly follow the uniform distribution, 
according to the Theorem 7.1. 

However, we can approach this test as a multiple test of 300 Gaussian vari- 
ables (7.5). Obviously, this is not convenient now, but it often happens to deal with 
families of tests, rather than with a single experiment consisting of repeated trials. 
Generally speaking, suppose we have a test consisting of a family of hypotheses 
A, H2,..., Hm and we want to verify if all the hypotheses of the family must be 
accepted. Suppose also to have a test level a = 0.05. In our previous example, we 
have 300 p-values, and, although the simulated hypotheses are obviously true, out 
of 300 hypotheses we will have about 15 p-values p < a. 

In general, given m hypotheses, the probability for all the test results to be inside 
the acceptance region is (1—a)”, and therefore that of having at least one element in 
the rejection region is Pr = 1 — (1 — a)’. In Fig. 7.7 the behaviour of Pr is shown 
as a function of the number m of performed tests, when all the hypotheses of the 
family are true and aw = 0.05. As the plot clearly shows, the probability of rejecting 
at least one true hypothesis and therefore of having false positives increases very 
rapidly as m increases. To solve this problem, we must observe that now we are 
mainly interested to evaluate the level of the family test a7 and we must therefore 
distinguish it from the level a of the single test. We then define the two relations, 
one inverse of the other: 


ap =1—(l—a)"~ma, (7.44) 
cHi= Co a=. (7.45) 
m 
The exact formula is known as the Sidak correction, whereas the approximation 
arp Yma, (7.46) 


which is easily obtained with a Taylor expansion of the term (1 — a)” ~ 1 — ma 
around a ~ O, is named Bonferroni correction. 


7.7 Multiple Tests 293 


Therefore, giving a null hypothesis Ho, consisting of one family of m hypotheses 
Hj associated with m p-values p;, one of these two equivalent procedures must be 
chosen to perform a test at the af level: 


(1) Reject Hp if, for at least one hypothesis, it results: 
pi <ar/m, or pj <1—(—a,z)!/". (7.47) 
(2) Transform all p; according to the rule: 
Pi > Pi =mMpi, Or pi > pp =1—(U— pj)” (7.48) 


and reject the hypothesis if there is at least one p} < ap. 


Normally, the Bonferroni correction is used for simplicity, but, for large p; values, it 
is convenient to apply Eq. (7.48), i.e. the non-approximate formula. For example, if 
pi = 0.1 and m = 5, Bonferroni’s correction gives p; = 0.50, the correct formula 
p; = 0.41. However, since usually the family test is done for values of ar = 0.01 
or ar = 0.05, this discrepancy is often not crucial. 

The R routine p.adjust (pv,method) uses the procedure (2), where pv 
is the vector of p; and method selects the test type. With the call pout <- 
p.adjust (pv, method=’bonferroni’ ), the routine applies the first of 
Egs. (7.48) and gives the vector pout containing the p} values as output. This 
allows us to identify the family hypotheses that do not satisfy the test level. Our 
routine MultiTest (pv,method,alpha,print), method=’sidak’ ) 
applies also the Sidak correction using the second of Eqs. (7.48); this routine 
also checks how many hypotheses do not satisfy the predefined alpha level and 
allows you to check the output with the print parameter. Further details on the 
use of these routines are given in the problems at the end of the chapter. The Sidak- 
Bonferroni (SB) correction completely solves the problem of multiple tests when 
all the assumptions of the family are true. However, the goal of tests is usually 
to identify which of the family’s hypotheses are false. Consider, for example, a 
drug or a group of drugs given to different groups compared to a control group 
that was given a placebo. In these cases, it is assumed as a null hypothesis that all 
groups of the family are equivalent, and the hypotheses H;, which do not satisfy 
this criterion, correspond to the groups to which the effective drug has been given. 
In this situation, the test is required to be able to identify false hypotheses with the 
maximum efficiency, since they are connected to the searched effect. 

This crucial feature is called power of the test. The real situation, summarized 
in Table 7.3, is then more complex than that considered so far. In the formalism of 
the table, the power is defined as the mean value of the fraction V/m, of the false 
hypotheses correctly identified. 


294 7 Basic Statistics: Hypothesis Testing 


Table 7.3. Possible results of a multiple test with m hypotheses Hj, of which mo true. False 
positives (FP) are named type I errors; false negatives (FN) are type II errors 


Accept Hj Reject Hj Total 
_j true U true negative F false positive mo 
Hi; false W false negative V true positive |m; =m—mo 
Total m—R R m 


When analysed in terms of power, the SB correction is completely unsatisfactory: 
in fact, the values p’ of Eq. (7.48) increase linearly with the number of hypotheses 
present in the test, and many false hypotheses assume quickly p-values above the 
test level ay and are therefore not correctly identified. In statistical jargon, the SB 
correction is not very powerful and conservative (i.e. new effects are easily missed), 
or it has a low detection potential. For this reason, the technique of multiple tests 
has been refined in recent years with many other methods, developed with the aim 
of identifying the false hypotheses H; present within the Ho family. For example, 
in modern genomics, while sequencing complex genomes, multiple tests are used 
against a null hypothesis requiring hundreds or thousands of p-values. Almost all 
of these tests are included in the R software. We will here describe the Benjamini- 
Hochberg [BH95] test, called the BH test, which is one of the most frequently used. 
To start, we have to introduce two important terms of the multiple test language 
that describe the probability of getting false positives: the family-wise error rate 
(FWER) and the false discovery rate (FDR). 

FWER indicates the probability of making at least one type I error, that is, the 
probability that at least one true hypothesis of the family does not pass the test level, 
ie. P{F > 1}, with the notation of Table 7.3. For example, when we have ten 
hypotheses, wr = 0.05 and Eq. (7.45) is used, FWER gives the probability that at 
least one test on H; has pi < ar/m = 0.005. This fact, if all the hypotheses are 
true, should occur on average in 5 per thousand of total tested hypotheses, and since 
the testing procedure checks ten hypotheses, this happens on average in a fraction 
of the family test exactly equal to wf. Therefore, if all the hypotheses are true: 


FWER = P(F = }=1-(1-)" ar, mo=m, (7.49) 


where the number F of true hypotheses that are rejected is defined in Table 7.3. 
When a property holds under the condition mg = m which, according to Table 7.3 
implies that all the hypotheses are true, it is said valid in the weak sense. When 
instead a property holds for mo < m, it is said valid in the strong sense. 

For example, assuming the existence of false hypotheses within the family, and 
denoting withi;, 7 = 1,2,..., mo the indices of the tests corresponding to the true 


7.7 Multiple Tests 295 


hypotheses, it can be shown that the following property holds in the strong sense for 
the Bonferroni correction: 


mo mo 


mo 
FWER = P ve < P{p;i, < =—arp. 7.50 
Ue, <ap/m)} < 2 (pi; S@r/m}=— or. (7.50) 


Equation (7.50) is also valid in the general case of mutually dependent hypotheses, 
because the symbol of set union takes into account the possibility of refusing at the 
same time two or more hypotheses when their p-values are mutually dependent; 
the inequality holds on the basis of Eq. (1.17). We can therefore state that the 
Bonferroni correction of Eqs. (7.44, 7.45) ensures by construction the property 
of FWER = af in the weak sense, and FWER < a, in strong sense, even for 
correlated hypotheses. ! 

The FDR property instead has been introduced to evaluate the strong properties 
of the test, when mo < m. It is defined as the expected value of the ratio between 
the number F of the true hypotheses Hj, falsely rejected, over the total number R 
of the rejected hypotheses. Hence, FDR = (F/(F + V)) = (F/R) if R > 0, while, 
to avoid division by zero, FDR = 0 otherwise (see, as usual, Table 7.3). 

When mo < m, one obtains, in the strong sense: 


FDR = ( 


mo 
mo 
7 -) 2iF)\ = 2 Pl pi, <ar/m} = ——ap =FWER. (7.51) 


When mo = m, FDR can be considered as the mean value of a binary random 
variable: 0 when F = 0 and | when F > 0, since in this case V = O and F/R = 1. 
We can then write: 


FDR = P{F = 0}-0+ P{F > 1}-1= P{F > 1} =FWER. (7.52) 


All previous considerations can be summarized by the following two important 
properties: 


FDR < FWER , mo <™m, (7.53) 
FDR = FWER=ar, m=m. 


To correctly apply these formulae during the following discussion, it is useful to 
keep in mind that only m and R, among the variables defined in Table 7.3, are 
known to the experimenter. 


' Here we define as correlated hypotheses the cases in which test statistics are correlated. 


296 7 Basic Statistics: Hypothesis Testing 


One of the most diffused methods used to increase the power of the test is that of 
Benjamini-Hochberg, known as the HB method. In 1995 they demonstrated 


Theorem 7.2 (of Benjamini-Hochberg) A family of hypotheses H,, H2,..., Hm 
is given, corresponding to a set of p-values p1, p2,..., Pm, ordered in increasing 
order. If k is the largest index i satisfying the inequality: 


pi < —a (7.54) 
m 


and all the hypotheses H;, i = 1,2,...,k are rejected, a test with FDR < a is 
obtained, for any configuration of false and true hypotheses when they are mutually 
independent. 


Proof For the non-trivial proof of the theorem, based on the principle of induction, 
we refer to the original article [BH95]. oO 


The theorem leads to a very simple method, similar to that of Bonferroni based on 
the first of Eqs. (7.48): instead of multiplying all p; by m, just sort them in ascending 
order and multiply them by m/i, where i is the index obtained in the sorting. The 
value P} is therefore identical to that of Bonferroni, while the last one remains 
unchanged. The remarkable fact, stated by the theorem, is that, if we exclude all 
the hypotheses for which p; < a, we obtain an expected value FDR < a for the 
fraction of true hypotheses falsely rejected. 

The HB method is implemented in the R routine p.adjust (pv,method, 
alpha) with the command method=” HB” or equivalently with method=" fdr”. 
Finally, it should be kept in mind that in this case the parameter alpha (which 
is 0.05 by default) represents the upper bound of FDR rather than the global test 
level ar. Therefore, with the HB method, the control of FWER is abandoned. This 
parameter can then assume large values, even around 0.5. We recall that, from 
Eq. (7.53), if FWER is controlled, the same happens for FDR; however, the vice 
versa is not true, because if the power increases, also the number of true hypotheses 
with small p-values (due to statistical fluctuations) that are rejected inevitably 
increases. 


Exercise 7.10 

Generate with R 900 standard normal deviates and 100 variates S ~ N(3, 1) 
with yz = 3 and test the Hp hypothesis of origin of the data from the standard 
Gaussian N(0, 1). Use Bonferroni (SB) and Benjamini-Hochberg (BH/fdr) 
methods. 


(continued) 


7.7 Multiple Tests 297 


Exercise 7.10 (continued) 

Answer The requested data are generated with the R routine rnorm, and 
the corresponding p-values pg are evaluated with pnorm. Then the data are 
analysed with our routine MultiTest: 


g <- c(rnorm(100,mean=3) ,rnorm(900) ) 
jsj a= i= jeiarenatin(())) 
MultiTest (pg,method=’ bonferroni’ ,alpha=0.05) 


With a second call to MultiTest with the “fdr” method, all the test 
results are obtained. 

We note that, without the multiple test techniques, we have now obtained, 
with a = 0.05, 11 data accepted among the 100 in disagreement with the 
hypothesis (equal to a type II error of 11%) and 45 rejected data among the 
900 correct ones (equal to a type I error of 5%, as predicted by the value 
of a). Since these are simulated data, different simulations will give slightly 
different values. 

With the statement 1- pchisq(sum(g*2) ,1000), where the first 
argument is simply Eq. (7.25) with 4 = 0, o = 1 and the second is the 
number of degrees of freedom, we can verify Ho with the x7-test. A right one- 
tailed p-value near to zero is obtained, indicating the presence of many wrong 
hypotheses. With Mult iTest we can then search for the false hypotheses. 
We obtain Table 7.4, which clearly shows the power gain which is acquired 
with the BH method. Notice that in a real experiment only R is known. 
For a simulated calculation of the parameters FDR and FWER, it would be 
necessary to repeat the exercise a very large number of times, also reducing 
the number of elements of the family if necessary. This evaluation can be 
performed with our LogpFdr routine, that you are invited to examine. 


Table 7.4 Results of Exercise 7.10 obtained with the routine 
MultiTest. The symbols are those of Table 7.3 

Method F/R Power V/m, 
Bonferroni 0/22 0.002 

BH-fdr 3/66 63/100 


298 7 Basic Statistics: Hypothesis Testing 
7.8 Snedecor’s F-Test 


Similarly to the case of means, also tests on the compatibility between variances of 
distributions can be performed. 

For samples with less than a hundred events, this test cannot be performed using 
the Gaussian or Student density. However, for Gaussian samples the Snedecor’s 
density F’,, introduced in Exercise 5.6, can be used. These tests, called F-tests, have 
been generalized and used extensively in the ANOVA method, which is illustrated 
in the next section. 

If st and sa are two variances obtained from two independent samples with 
N and M events, respectively, coming from Gaussian populations having the same 


variance a’, we know, from Eq. (6.72), that the value: 


2 7_2 2 
sy, /o s 
Fy-1,M-1 = las = aa , (7.55) 
Si /0 Si 
is the ratio between two independent reduced x? values. From Exercise 5.6 we 
then know that this ratio follows the Snedecor’s density F' of Eq. (5.46) with (VN — 
1, M — 1) degrees of freedom. The combined use of Eq. (7.55) and of the F density 


quantiles of Tables E.5, E.6 is called analysis of variance or F-test. 


Exercise 7.11 
Two experimenters, who claim to have sampled from the same Gaussian 
population, have obtained variances equal to 


s=125, s3=64, 


with samples having 20 and 40 events, respectively. Check if the two sample 
variances are compatible with a single true value o7 at the 2% level. 


Answer The variable F is given by: 


12S) 
i? = == = 185. 
6.4 
Since the (two-tailed) test is at the level of 2%, for the initial claim 
to be accepted, the experimental ratio F must be smaller than the 99% 
F value that can be obtained from Table E.6 or with the R statement 
gf (0.99,d£1=19, d£2=39): 


Fo.99(19, 39) ~ 2.41, 


(continued) 


7.9 Analysis of Variance (ANOVA) 299 


Exercise 7.11 (continued) 
and larger than the 1% F value that can be obtained with Eq. (5.49) or with 
the R statement qf (0.01, d£1=19,df2=39): 


1 


Fi 19, 39) = ———_____ © 
boat ) Fo.99(39, 19) 


1 
— = 0.36. 
Dell 
Since: 


0.36 < 1.95 < 2.41, 


the two values are compatible at the 2% test level. 


7.9 Analysis of Variance (ANOVA) 


In Sect. 7.3 we applied the t-test to verify the hypothesis that two independent 
samples have the same mean. This is the simplest example of analysis of variance 
(ANOVA), that is, the set of procedures used to establish whether groups of elements 
behave in a similar way or not (i.e. beyond purely statistical fluctuations) under 
different conditions. 

The first step towards a test generalization is to consider more than two groups, 
i.e. more than two conditions, such as the comparison between a reference drug 
and alternative drugs for the treatment of a disease, the effect of different teaching 
methods on learning and the effect of different soldering methods on the quality 
of printed circuit boards. In the same way, more levels of the same treatment can 
also be compared, such as a chemical reaction at different temperatures or different 
dosages of the same active ingredient in a drug. 

The application examples we have just mentioned indicate that ANOVA applies 
to programmed experiments, for which there is a specific statistical terminology: 


¢ The response is the main parameter of interest (e.g. the maths learning level). 

¢ The factors are other quantities that are varied during the experiment, because it 
is assumed that they can influence the response (e.g. the teaching method). 

¢ The different values that can be taken on by the factors are called levels. 

e A factor can be qualitative or quantitative, where in the first (second) case the 
levels cannot (can) be put in correspondence with values on a scale. 


When experiments are planned, it is almost always recognized that there are 
different factors influencing the response and, to optimize time and material 
resources, multiple factors are examined at the same time. However, to facilitate 
our presentation, we will start with an example of a one-way ANOVA, with only 
one single factor. 


300 7 Basic Statistics: Hypothesis Testing 


Table 7.5 Number of breaks 


ar fas Ebi Tension Breaks 

in the warp of a fabric 

according to the tension of Low (L) 27 | 14 |29 |19 |29 |31 |41 | 20 | 44 
the loom Medium (M) | 42 | 26 | 19 | 16 |39 |28 |21 | 39 | 29 


High (H) 20 |21 |}24 |17 |13 | 15 | 15 | 16 | 28 


To produce a fabric, the threads of the weft are intertwined with those of the warp. 
When weaving, warp threads are stretched parallel on the loom and can break. In 
one experiment, several fabric samples of equal length were produced by subjecting 
the warp to high, medium or low tension, and the number of warp breaks in each 
sample was counted. The results are shown in Table 7.5. The factor of interest is the 
tension and the response the number of breaks. The basic idea behind ANOVA is 
to identify the sources of variations, in order to disentangle the effects of the factor 
on the response from pure statistical effects. The way to proceed is suggested by a 
simple algebraic equivalence: denoting by y;; the number of breaks in the fabric j 
produced with tension i, by m the number of tension levels and by n the number of 
fabric samples produced for each tension level, we have in fact: 


Vij =I. + Gi. —YI+ 07 —Hi), t=1,...,m;fHl....n, (7.56) 


where y;, = 0; yij/n and y,, = Jj; yij/(mn). For convenience we have 
associated the tension levels L, M and H to the indices 1, 2 and 3. 

Hence, the fabric ij has a number of breaks given by the sum of the general 
average of all the breaks, adjusted with the difference between the average breaks 
of the fabrics at tension i and the general average (effect of tension i) and with 
the difference between the breaks in fabric ij and the average of fabrics at tension 
i (statistical fluctuation). If we generalize the specific sample of 27 tissues into a 
population model, we can consider both the general average effects and the tension 
effect as population parameters and define the following model: 


Yjj =Ut+T + &ij, a eee (Pee es bee (7.57) 


Here ¢;; are independent random variables with zero mean, and then Y;; is a random 
variable with mean j1+7;: that is, it is expected that on average the number of breaks 
is determined by the loom and the specific tension set. Since }°; (¥j, — y..) = 0 by 
definition, it is natural to set ; t; = 0. This constraint also ensures that we have 
m independent parameters to describe the m populations identified by the m tension 
levels. With the model (7.57), we can investigate whether the tension has an effect 
on breaks using the hypothesis test: 


Ao: =-:--=%Tm=O against HA): di such that 7; 40, (7.58) 


which requires an assumption on the distribution of ¢;;, that is, e;; ~ N(O, a”). We 
will discuss later this assumption and that of independence among the ¢;;. From the 


7.9 Analysis of Variance (ANOVA) 301 


algebraic equivalence (7.56), the total sum of squares can be decomposed as done 
before in Eqs. (2.40, 6.55): 


YO — 5.)° = DOG. — 9.) + D0 — Fi)? = SStr + SSz. (7.59) 


ij i ij 


This quantity can then be partitioned into the sum of squares between groups (breaks 
at different tension), often denoted as SS7,. (treatment sum of squares), representing 
the variation due to the tension, and the sum of squares within groups (error sum 
of squares), often denoted as SSr, which gives the statistical measurement errors 
common to all data. Under Ho, the variables Y;; are iid N(y, o”), and hence also 
the variables Y;. are lid N(y, o2 /n). Therefore, due to Theorem 6.2: 


SSr i <—we, = 
M81, = ~~ = — dh aT oe A=); (7.60) 
i= 


where MS means mean squares. Likewise, a (Vij — Y;,)? ~ o*x7(n — 1), and, 
taking into account that SS is the sum of m independent x? distributed variables 
due to the independent responses, we then have: 


eS) vo a ee ee 
MES Gt) mG Yi)? ~ o7xR(m@—1)). (7.61) 


Then, under Ho, both MS;7; and MSz are two unbiased estimators of o?, and we 
expect their ratio to be not far from 1. Indeed, (MS7,;) = a7 (x2 (m—1)) =o? and 
similarly (MS) = o7. Since the scalar product (4.74) 3°; [(i. — ¥..) vj OW _ 
y;.)] = 0 because x (ij — Vi.) = ny, —nyi, = 0, for the Cochran’s Theorem 4.5, 
M Sv, and M Sg are also independent. Then, from Eq. (7.55), we have: 


_ MSr;y 


ee F(m —1,m(n—1)), (7.62) 


and Ho will be rejected at the level a if F > Fi_g(m — 1,m(n — 1)) or, on the 
basis of the p-value, which is the probability that F(m — 1, m(n — 1)) is greater 
than the observed value F: P{F(m — 1,m(n — 1)) > F}). We can easily convince 
ourselves that the test rejection region is the correct one by observing, first of all, 
that the distribution of MSz, the mean sum of squares of the tissues at the same 
tension, is always the same, both below Ho and below A; indeed from the model 
(7.57), we have: 


Yj — Yj. wt tt &ij — bh Ui — i, = ij — Ei. , 


302 7 Basic Statistics: Hypothesis Testing 


Fig. 7.8 Data of Table 7.5, 2 
number of breaks in the warp = 
compared to tension eS o 
< 
° 
Te) 
ro) 
28 " 
3s ° ° 
© ° ° 
Ss ° 
Te) 
ro) 
° 
° 
4 0° ° 
“lo 
° 
° ° 
1) fo) 
° 
° 
! ! ! 
L M H 
tension 


independently of t;. Conversely, the distribution of MS;, changes under Hy 
to a non-central re distribution? multiplied by o*, with expected value 07 + 
n>; /(m — 1), and in this case we expect to observe large values of F,, at odds 
with the distribution under Ho. 

Let us focus now on the data from the weaving experiment. Before performing 
any hypothesis tests, it is worth to examine a graphical representation of the 
data. We want to analyse the relation between tension and breaks; therefore, the 
representation of the number of breaks with respect to the tension in Fig. 7.8 is 
adequate. The figure shows a possible difference in mean between the tensions and 
a dispersion of the measures within the groups around their own mean apparently 
not completely homogeneous. 

To execute the test with R, an object of the type data frame should be 
created and filled with the data of Table 7.5 using the command data = 
data.frame (breaks, tension), where breaks is the vector obtained 
by concatenating the rows of the table and tension = rep(c("L", "M", 
"H"), each = 9) is obtained replicating nine times the factors L, M, H. 
Then ANOVA is performed with the R routine aov, to which the table data is 
loaded, specifying that the first column contains the breaks and the third one the 
factor tension. The result can be stored in an object fit with the command 


2 The non-central x2 distribution designates the variable Q = 7, x? where X; ~ N(w;, 1) and 
i we = 0. The central (standard) x? distribution has A = 0. 


7.9 Analysis of Variance (ANOVA) 303 


Table 7.6 ANOVA table: R output with the data of Table 7.5, testing of the tension effect on the 
number of breaks 


df SS MS F Pr(> F) 
Tension 2 568.5 284.26 4.059 0.0303 
Residuals | 24 1680.7 70.03 


Table 7.7 General ANOVA table for the one-way analysis of variance 


Source of variation | df Sum of squares | Mean of squares | F p-value 
SST, = 

Treatment m—1 n>, (i.—-5..)° | MSr, = SE ee P(> F) 
SSp= 

Residuals man—-1) Yj -3)? |MSe= ae 
SSr = 

Total mn—1 Ly Oy = 7¥ 


fit<-aov(breaks tension, data=dat).The ANOVA table can then be 
completed with the function summary that analyses the output of aov. 

The command summary (fit) produces Table 7.6. The p-value under five per 
cent, that can be obtained also with the command 1-pf (4.059,2,24), suggests 
to reject the hypothesis Ho of the absence of tension effects on warp breaks. In 
Table 7.7, the equations used by aov in Table 7.6 for the calculation of the sum of 
squares, of the mean squares and of the test function have been collected. The term 
“residuals” in the table is common in the linear regression models that are discussed 
in Chap. 11. Indeed, the SS¢ value can be also obtained with a least square estimate 
of the unknown parameters yz and t; of Eq. (7.57) by setting SSp = vij (yij — A 
t;)7, since i + t = ji,. 

So far so good, but, to outline the main aspects of the procedure, we have 
postponed the verification of some important assumptions. Taking for granted the 
validity of the additivity effects of Eq. (7.57), we need to perform the following 
checks on the collected data: 


(1) Are the errors ¢;; really random? 

(2) Is the variance of the number of breaks really the same for each tension? In 
other words, are we sure that o7 does not depend on the group? 

(3) What happens to the F-test distribution if the Gaussian assumption is 
violated? 


Errors are a random sample: this assumption is assured by the randomization, 
that is, by the random allocation of the experimental material to each test and 
by the random order of the performed tests. This is to prevent factors beyond 
the investigator’s control from systematically influencing the results. In a real-life 
ANOVA, it is then crucial to know how the experiment was planned and executed. 


304 7 Basic Statistics: Hypothesis Testing 


Constant variance: this assumption is fundamental to ensure that the pooled 
estimate of o7 given by M Sz is valid, independently of Eq. (7.61), which is related 
to the probability distribution of MSz. In Fig. 7.8 the high tension seems less 
dispersed than the other two, but some tests need to be performed for an appropriate 
verification. 

For instance, the Levene’s test [Lev60] eliminates the effect of the averages in the 
deviations by transforming the vector of the responses in each group into a vector of 
absolute values of deviations with respect to the mean or the median. The use of the 
absolute value becomes necessary to avoid deleting deviations. Simulation studies 
have shown that deviations from the medians generally lead to distributions of the 
ratios between the means of the squares (MS) roughly distributed according to F. 

The three medians of the tensions L, Mand Hcan be calculated, and then a new 
vector breaks1 can be created to be read by aov: 


breaks1 [1:9] <-abs (breaks [1:9] -median (breaks [1:9] ) 
breaks1 [10:18] <-abs (breaks [10:18] -median (breaks [10:18]) ) 
breaks1 [19:27] <-abs (breaks [19:27] -median (breaks [19:27]) ) 
summary (aov (breaks1~tension) ) 


A p-value of di 0.25 is obtained, in good agreement with the hypothesis of 
variance equality. 

In R, the available tests are the Bartlett’s test (based on the likelihood ratio, 
see Chap. 10) and the Levene’s test. For what is written above, the latter is to be 
preferred in case there are doubts about the Gaussianity assumption. The function 
bartlett.test (breaks tension, data=dat) gives a p-value of 0.15; 
therefore, even with this test, we do not reject the hypothesis of constant variance. 
The function which performs the Levene’s test is present in the library car, to 
be installed and loaded, as it is not included in the basic R distribution: the call 
leveneTest (breaks tension, data=dat) gives a p-value of 0.25, in 
agreement with that obtained from aov by rearranging the data. 

Gaussian errors: since the errors €;; are not directly observed, we have to use an 
estimate of them, such as the residuals y;; — y;., to verify their properties. The 
residuals of different groups are uncorrelated to each other, but this is not true 
for those in the same group since, as you can easily verify, ((Yij — Yi.) (Vik — 
Y;,)) = —o7/n. Therefore, the residuals are not a completely random sample; 
however, it is a common practice, after sorting, to standardize them with the 
function rstandard (fit), where fit is the output of aov, and represent 
them in a Q-Q plot, as done in Fig. 2.7, versus the expected value of an ordered 
random sample from a standard Gaussian. The plot thus obtained, shown in 
Fig. 7.9, is called Gaussian Q-Q plot and can be drawn with the commands 
qqnorm(rstandard (fit) ) and abline (0,1), where the second command 
plots the bisector. The points are arranged approximately around the straight line, 
so we can consider the errors as Gaussian. Also in the case of slight misalignments, 
we can still use the F test, because it has been shown, by different studies, that the 
latter tolerates violations of normality rather well. 


7.9 Analysis of Variance (ANOVA) 305 


Fig. 7.9 Gaussian Q-Q plot Normal Q-Q Plot 
of the ANOVA standardized 
residuals from the data of the 
textile experiment of 

Table 7.5 


Sample Quantiles 


Theoretical Quantiles 


All the assumptions made are therefore justified, and we can state that we have 
found that the tension has an effect on the number of defects. What are the tension 
levels responsible for the test result? The answer to this question is of considerable 
practical importance, as it indicates the most suitable level of tension to minimize 
breakage. 

From the plot of Fig. 7.8, the high tension group has the smallest mean ¥y;,, but 
as usual we have to perform a statistical test to verify whether that this feature is 
systematic or random. If 4; = 4 + 7;, the comparison between all pairs of expected 
values, called Tukey test [Tuk49] , addresses the question: 


Ho: Wi = bj . : 
for alli Aj. (7.63) 
Ay: bi Fj 


We have just seen in Sect. 7.7 that, in multiple tests, it would be wrong to perform 
m(m — 1)/2 individual tests at the level w assuming that the probability of rejecting 
at least one hypothesis under Ho is a. Tukey’s criterion for checking the a level of 
the family is to choose a single critical value c for all tests in the family in order to 
obtain a given af level. It is based on the following equivalence of events: 


J Fi. — Fel > ch <> (max — Fmin > c} (7.64) 
itk 


where Ymax and Ymin are the minimum and maximum values of the within group 
sample means. Equation (7.64) means that, as is obvious, there is, in absolute values, 


306 7 Basic Statistics: Hypothesis Testing 


at least one difference between means exceeding c if and only if the difference 
between the maximum mean and the minimum mean exceeds c. Since Var[Y, i] = 
o7/n, with the correct estimate given by MSz/n, it is natural to consider the 
standardized differences divided by ./M S/n and reformulate the Tukey’s criterion 
as follows: 


lVi. — Veal Ymax = Ymin 

——— >ce = 9 SC} .. 7.65 
U ase VMSzE/n ( ) 
Therefore, by appropriately choosing c, we are able to identify as different all the 
pairs of means that exceed the threshold with a significance level a. The choice of 
c is based on the p.d.f. under Ho of the statistic: 


= Ymax — Ymin 


1 VMSein 


which is the so-called studentized range statistic, that is the distribution of the dif- 
ference between the maximum and minimum of m independent Gaussian variables 
having the same mean, i.e. the Y;. *s, divided by an estimate of its standard deviation 
obtained from the pooled variance estimate. 

The percentiles of the studentized range are tabulated and indicated with 
da(m, df), where m is the number of groups and the degrees of freedom are those 
of MSr,so df = m(n — 1). InR, they are calculated by the qtukey routine and 
can be easily evaluated using simulation codes as well, as in Problem 8.19. Then 
we can set c = qq,,(m, df) and claim as different all the pair of means (j1;, (4x) for 
which: 


(7.66) 


MSE 


Yi. — Yel > dar (m, df)\/ —— 5 (7.67) 


with df = m(n — 1). From this equation it is immediate to obtain the confidence 
interval for the expected difference between the two means: 


7 : MSE ae 
(Hi — Wy) € i. — Yj.) £ dap (m, df) a Viz lj. (7.68) 


The function TukeyHSD (fit) performs Tukey’s test and computes the confi- 
dence intervals and the p-values for each difference of means at a significance level 
ar = 0.95. The result, given in Table 7.8, shows that the mean number of breaks 
with a high tension level is lower than those with a medium or low tension level and 
that there is no significant difference between the latter two. When significant, the 
sign of the difference indicates the most probable ordering of the means. 

All the procedures so far described assume that the experimental plan is balanced, 
i.e. the number of trials for each factor level is constant. This generally ensures 
greater robustness in the case of assumption violations (such as that of constant 


7.9 Analysis of Variance (ANOVA) 307 


Table 7.8 R output of Tukey’s test for the differences between the number of expected breaks for 
pairs of tension levels. The data are from Table 7.5 


diff lwr upr p adj 
L-H 9.444 —0.419 19.296 0.062. 
M-H 10.000 0.149 19.851 0.046 
M-L 0.556 —9.296 10.407 0.989 


variance) and greater test power. The formulae to use in unbalanced plans are, for 
instance, given in [Mon03], from which we have taken much of this section. 

Before finishing this example, it is worthwhile to briefly return to the issue of 
constant variance. It is true that the Bartlett and Levene’s tests do not reject this 
hypothesis, but the p-value of the Bartlett’s test is not particularly high and the 
data graph in Fig. 7.8 indicates a lower variability of the response when the average 
number of breaks is lower. If the breaks in the tissue sample are randomly distributed 
and with a smaller and smaller probability as the examined area decreases, we are 
in the presence of a Poisson process in space, similar to that in time described in 
Sect. 3.7. This assumption is perfectly reasonable if the loom has been overhauled 
and tuned before the experiments. The variance of a Poisson variable with mean pu is 
just jz, and this could explain the lower dispersion at the high tension level. Now we 
wonder: do we have to redesign an inference method for the Poisson variable from 
scratch, or can we continue with the one for the Gaussian variable? Using the correct 
distribution is in principle better, but often it is enough to exploit a standard method 
and get the needed answers without perfectly describing the process generating the 
data. In this case, we look for a data transformation that stabilizes the variance, 
such as y*, a solution often adopted. To choose 4, we observe that, using Poisson’s 
hypothesis and by Eq. (5.58): 


ae 
Var[Y*] ~ (=) Val¥i=Ge" (eave, (7.69) 
so that A = 1/2 results in a constant group variance. You can redo the data 


plot, the ANOVA table, the constant variance tests and the residual analysis with 
the transformed data, and you will find that the F-test has a slightly smaller p- 
value. The tests on the constant variance have significantly higher p-values, and 
the comparisons between means with Tukey’s test are slightly more significant, 
confirming the conclusions already reached. 

From this digression we have learned that: 


¢ If we can transform the data to bring us back to a standard procedure, we save 
time and better focus on the fundamental aspects of the problem. 

¢ The chosen transformation depends on the assumptions about the procedure used 
to generate the data. If we are unable to make reasonable assumptions and we 
find the correct transformation by trial and error, we will have a perfect analysis, 


308 7 Basic Statistics: Hypothesis Testing 


but we will then have to explain if and how the same conclusions hold for 
untransformed data. 

¢ When the assumption violation is slight, the fundamental conclusions continue 
to hold even without applying transformations. 


Before dealing with the two-way ANOVA, let us now briefly address the question 
of choosing the sample size on the basis of the power of the F-test for the one-way 
ANOVA (we have already introduced the power in Sect. 7.7). 

In practice, we would like to know how many data we need to collect to reject 
Ho, i.e. what is the optimum number of trials n for each level of the factor, assuming 
the effect has a given value t;. The answer comes from the calculation of the power 
of the test: n must be the smallest value that, under H), allows us to reject Ho with 
the desired power, for example, 95%, when a test is performed at the assigned level 
a. The use of this criterion leads to an extremely important consequence, because if 
we perform too few tests we could accept Ho even if the effect does exist, missing 
the discovery. 

Therefore, if we require a power 6 > 95%, we must calculate: 


MSt;, 

B(t1,-.-,T) =P > Fi-a(m—1,mn—m);%1,...,Tm¢e ; (7.70) 
MSkrE 

and choose the smallest n value such that B(t],...,T,) > 0.95. The second 


member of Eq. (7.70) shows that the probability to reject Hp is evaluated from the 
assigned Tt; values. 

The computation of 6(t1,..., Tm) depends not only on t;, m and n but also on 
o, as can be derived from the considerations on the non-central < value following 
Eq. (7.62). Under H), the distribution of MS7,/MSz is in fact a non-central F 
density? with expected value: 


(mn —m)\(m—-1+ny; 17/07) 


(7.71) 
(m — 1)(mn — m — 2) 


This value turns into the expected value (5.47) of F when all the 1; values are zero. 
From this equation, one deduces that the F distribution depends on the quantities 1; 
through >>; t7 or, equivalently, through )>; (ui — )?/(m — 1) = 0; t7/(m — WD). 
This last quantity is a sort of “variance” of the mean effects of the factor levels. 
This last parametrization is used in the R function power.anova.test for the 
calculation of the power function. 

Let us imagine that before carrying out the weaving machines experiment, the 
effect of the tension would be considered satisfactory if, by varying it, there is a 
reduction of at least ten breaks. This means that, for example, the power has to be 
evaluated for w; = 35, “2 = 25 and 3 = 15. We also assume to have an a priori 


3 The non-central F is the distribution of the ratio of two independent y* random variables, where 
the x? in the numerator is non-central. 


7.10 Two-Way ANOVA 309 


information that o2 could be about 60. We consider m = 3, the number of levels, 
as a fixed parameter. With a = 0.05 and n = 9, the routine power. anova.test 
gives the following output: 


power.anova.test (groups=3, n=9, between.var=var(c(15,25,35)), 
within.var=60, sig.level=0.05, power=NULL) 


Balanced one-way analysis of variance power calculation 


groups = 3 

n= 9 

between.var = 100 
within.var = 60 


sig.level = 0.05 
power = 0.9976002 


where we see that the power of the ANOVA test for the chosen values of 7;, o”, 
n and @ is of 99.8%. With this function it is also possible to obtain the value of an 
argument keeping the others fixed by assigning to it the NULL value. For example, 
one can thus obtain the value of n providing the desired power. Without any precise 
information on o, it is possible to conservatively evaluate n for a range of o values, 
choosing the value of n corresponding to the maximum value of o, compatible with 
the experiment. Using values of o in {5, 6, ..., 15}, we execute the loop (note the n 
request through power.anova.test (...)$n): 


for (i in 5:15) 
print (power.anova.test (groups=3, n=NULL, between.var=100, 
within.var=i*2, sig.level=0.05, power=0.95) $n) 


and interactively obtain the following values of nm: 3.2, 4.0, 5.0, 6.1, 
7.4, 8.8, 10.4, 12.2, 14.1, 16.2, 18.4. This example shows that 
a high n value could be necessary to have a good test power if the experimental 
variability is high. In particular, with n = 9 and o = 15, we would get a power of 
66%, far from the 95% target. 


7.10 Two-Way ANOVA 


Now let us move on to the two-way ANOVA. Previously, we stated that experiments 
are often scheduled to modify more than one factor and also our textile example is 
not an exception. In fact, the data in Table 7.5 are for a single type of wool, type B, 
but the same tests with the three tension levels were also repeated with a second type 
of wool, type A. The warpbreaks object in the base distribution of R contains 
the complete data, as a data frame, as also shown by the output of the command 
class (warpbreak). 

The total number of tests is then 54, and the experimental plan is balanced, 
as displayed in the output of the function xtabs( wool+tension, 
data = warpbreaks) that is given in Table 7.9. This indicates that 


310 7 Basic Statistics: Hypothesis Testing 


Table 7.9 Table of the experimental plan that produced the object of the R_ library 
warpbreaks: there are nine replicates for each combination of wool and tension 


Tension 
Wool L M H 
A 9 9 9 
B 9 9 9 


a factorial plan has been adopted, that is, all possible combinations of 
the levels of the two factors were considered, and the whole plan was 
replicated nine times. The file columns can be extracted with the com- 
mands warpbreaks$tension, warpbreaks$wool, warpbreakss$ 
breaks; this last vector contains all the numerical data for the number of breaks. 
Reasoning in a similar way as we did for Eq. (7.56), we separate the effects of the 
factors from the statistical ones in order to obtain an algebraic identity, in which 
we try to identify the general mean, the difference that takes into account the effect 
of the level of the first factor, that of the second and the statistical fluctuation. This 
time we have to use three indices: i for the wool, j for the tension and k for each 
test performed on the level combinations ij. Associating to (A, B) the indices (1, 2) 
and to (L, M, H) the indices (1, 2, 3), we can write, tentatively: 


DP is = _ : = 2 
Vijk = Veuw + View — Veue) + Wj. — Veen) F ijk — Vij.) » (7.72) 
but we can easily realize that the identity is not valid and it has to be modified as: 


Vijk = Vere + View — Ver) + je — Vere) + ijn — View — Veje + Ye) + isk — Vio - 
(7.73) 


The last formula indicates whether there is a difference between applying the 
J level of tension when the wool is at the i level and between applying the j 
tension level by averaging the response of the wool levels. This is an interaction 
effect between the two factors: if the effect of the tension is the same for each 
type of wool, then its value at the wool i level is equal to the average value and 
the interaction is absent (apart from statistical fluctuations). With the command 
interaction.plot (x.factor=tension, trace.factor=wool, 

response=breaks, type="b"), we obtain the trend of the average of the 
responses to the variation of the tension, for each wool type, that is displayed in 
Fig. 7.10. We can easily notice that, with type A wool, the effect of the increase in 
tension is already present at the medium level, while with type B wool, it shows 
up only at the high level (as confirmed by the Tukey’s test performed earlier). In 
the case of no interaction, the dashed and solid lines would be roughly parallel. As 


7.10 Two-Way ANOVA 311 


Fig. 7.10 Interaction plot eo 1 
between tension and wool ‘ wool 
type obtained from the R 
warpbreaks dataset a -4+- A 
= 2 B 
Hn wo] 
3 9 
2 
Ss 
xe) 
5 3 
® 
& 
| 
a 
oO] 
a 


L M H 
tension 


in the one-way ANOVA, we establish a hypothesis test on the effect of the factors 
starting from a population model and from Eq. (7.73): 


Paes 
Yijk = + Bi t+ tj + (Brij + Eijk eee. (7.74) 
k=1,...,n 


where €;; are iid N(O, o”). The notation (BT) is not the product between 6 and t 
but indicates the interaction parameter between the two factors. In addition to the 
constraints for the mean effect of the factors, »; 6; = 0 and Yi Tj = 0, we add 
those for the interaction parameters, )°;(8t)jj = vi (Bt)i; = 0, as suggested by 
the identities 5°, (Vij. — Yin — Yj. + Yi) = 0, for j = 1,...,b, and i ij. _ 
Yi —Yej. + Yu.) = 0, fori = 1,...,a. The hypotheses to be verified are therefore 
whether there is an effect of the first factor (the type of wool), of the second factor 
(the tension) and of a possible interaction: 


Ao: Bi =--: = Ba =O against Ay : di suchas Bj 4 0 (7.75) 
Ho: =---=t%| =O against Hy : dj suchas t; #0 (7.76) 
Ao: (Bt)ij = 9, VU, 7) against Hy : Ai, 7) such that (Bt);; AO (7.77) 


As we did in Eq. (7.59), after some tedious algebraic calculations, we obtain the 
decomposition of the total sum of squares of the SS; response by separating the 


312 7 Basic Statistics: Hypothesis Testing 


components due to the different sources of variation, with which we can perform 
the tests of our interest: 


ijn isk — You)? = bn; Fie — Fue)? +0 DY jeje — Joe)? + 
n Vij (ij. ~ Vie ~ Vijs + y)° + Dijk ijk = Vij)? 


= SSrri + SSrr2 + SSint + SSE . 
(7.78) 


The useful property of this experimental plan is that we can verify each of the three 
null hypotheses of Eq. (7.75) independently of the others. Let us take, for example, 
SSv7,1: substituting y;;, with the right term of Eq.(7.74) and using the zero sum 
constraints of the parameters, we have: 


1 1 
Yin = a You + Bit tj + (Bt)ij + €ijx) = ba Yiu + Bit eijx). (7.79) 
jk jk 


Therefore, Y;.. ~ Nw Bi, o /bn), independently of the presence of the other 
effects. Then, taking into account that y,,, is the mean of y;,,, one has: 


SS 
MSrpi = ——* = TDG — 5..." ~ 07 xR(a- 1), (7.80) 


a-—1l 


under the hypothesis (7.75). With a similar reasoning, we can derive, under Hp, the 
distributions of the other means of the squares related to the effects of the factors 
and, as in the one-way ANOVA, verify that the distribution of MSF is always the 
same under any of the assumptions (7.75—7.77), that is, MSg¢ ~ Ox (ab(n— 1)). 
All the obtained results are collected in Table 7.10. Before calculating the ANOVA 
table with the data of warpbreaks, let us verify the hypotheses of variance 
equality and of Gaussian errors. The groups of observations that should have the 


Table 7.10 General table for the two-way ANOVA analysis. Under the Ho hypotheses (7.75— 
7.71), MSty1, MS7,;2 and MS;,,; are distributed as F (df). MSz is always distributed as F (df), 
independently of any hypothesis. The test p-value on each factor or on the interaction is the 
probability that F (df) is greater than the observed F value 


Source of variation df | Sum of squares Mean of squares | F 
Treatment 1 a-1l SSrr1 MS = Sou +e 
‘Treatment 2 b-1 SSry2 MSry9 = SSo2 “het 
Interaction (a — 1)(b—1) SSint MSin = mation ca ; 
Residuals ab(n — 1) SSE MS_ = — 


Total abn — 1 —_ | SSr 


7.10 Two-Way ANOVA 313 


Table 7.11 ANOVA table: check of Eqs. (7.75—7.77) with the data of the R library warpbreaks 


Df Sum Sq Mean Sq F value Pr(> F) 
Wool 1 2.90 2.902 3.022 0.088542 
Tension 2 15.89 7.946 8.275 0.000817 
Tension: wool 2 7.20 3.601 3.750 0.030674 
Residuals 48 46.09 0.960 


same variance are in this case the six groups identified by the combinations of 
the two factors as in Table 7.9. To use the routine bartlett.test with two 
classification criteria, we create a factor indicating group membership and then 
proceed to the test: 


# character vector with the pairs AL, AM, AH,... 

combo = paste (warpbreaks$wool,warpbreaksStension, sep="") 
# Bartlett’s test with the groups identified by combo 
bartlett.test (warpbreaks$breaks, combo) 


obtaining a p-value of 0.023. The tests with type A wool have a different 
dispersion than those with type B wool. By using again the square root 
transformation for the break numbers, the Bartlett’s test instead has a p-value 
of 0.289, so let us continue the analysis on the transformed data: yjjx = /Vijk- 
In the call to the aov function, we must give the information that we are 
considering two factors and their interaction, so we will execute fit = 
aov(sqrt (breaks) woolxtension,data=warpbreaks), where x 
indicates that the potential interaction needs to be taken into account. The Q- 
Q plot obtained as before with qqnorm(rstandard(fit)) is in Fig.7.11 
and indicates that the residuals have a distribution with tails slightly different 
than the Gaussian, but, since the violation is not gross, we continue with the 
analysis. The ANOVA table obtained with summary (fit) is reproduced in 
Table 7.11. From these data we conclude that the tension effect is highly significant, 
whereas those of the wool type and of the interaction are marginally significant. 
Also in this case, we can run the Tukey’s test to verify which pairs of tension 
levels are responsible for the significance. The results obtained with TukeyHSD 
(fit,which="tension") are shown in Table 7.12. If we disregard the type 
of wool, we conclude that, according to the p-values of Table 7.12, we can be 
absolutely certain of a significant effect of the tension when passing from L to H, 
while the different behaviour between two types of wool in passing from L to M and 
from M to H determines respectively a marginally significant and an insignificant 
p-value (see again Fig. 7.10 for a cross-check). 

We conclude this section on ANOVA with a remark on M Sr. From the compar- 
ison of Eq. (7.59) with Eq. (7.78), we can notice that SS7,; and SS7,2 are both con- 
structed using the sum of squares of the deviations of the mean responses for each 
tension level with respect to the general mean. M Sz is instead evaluated in a differ- 
ent way and, when systematic sources of variation are present and taken into account 
by the interaction factor, collects a smaller residual variation and therefore produces 


314 7 Basic Statistics: Hypothesis Testing 


Fig. 7.11 Gaussian Q-Q plot Normal Q-Q Plot 
of the standardized residuals n ° 
of the two-way ANOVA 
using the R library data 
warpbreaks 

n 

2 

i= 

© 

6 

2° 

Qa 

= 

@ 

no 

1 
° 


Theoretical Quantiles 


Table 7.12 Tukey’s test for the difference between expected number of breaks for pairs of tension 
levels, with the model (7.74) applied to the R library data warbpreaks 


a (oo 
M-L —0.04060713 0.0373236 
H-L —0.52361812 0.0005874 
H-M 0.30694285 0.3099954 


Table 7.13 ANOVA table Pa=F) 
or the check o 

Eqs. (7.75-7.76) without wae 2.902 | 2.273 _| 0.10520 
jiveraction effect with the R Tension 15.89 |7.946 |7.455 | 0.00147 
library data warpbreaks Residuals 53.29 1.066 a 


a lower estimate of o*. We can check this property performing a fit without inter- 
action, using the sign + instead of *, with fit=aov (sqrt (breaks) wool + 
tension, data=warpbreaks) and summary (anova (fit) ). The result 
is reported in Table 7.13, where the obtained values coincide with the ones of 
Table 7.11 except for those depending on the M Sz value. 


7.11 Problems 315 


7.11 Problems 


7.1 Two machines produce steel shafts. The average diameter values of two 
samples with ten shafts each are 4; € 5.36 + 0.05 and w2 € 5.21 + 0.05 mm. 
Apply Student’s test to verify the homogeneity of the pieces produced by the two 
machines. 


7.2 One group of 70 sick people was given a drug, and another group of 58 sick 
people assumed simple sugar (placebo). Evaluate whether the drug is effective based 
on the following contingency table: 


Drug Placebo Total 


Healed 40 28 68 
Sick 30 30 60 
Total 70 58 128 


7.3 The number X of buses arriving at a toll booth in 5-min intervals has been 
counted for VN = 100 times. The result of this measurement is given in the following 
histogram, with the discrete X spectrum divided into 11 channels (from 0 to 10): 


No. of busses xj 0 3 4 5 67 
1 8 


1 2 8 9 10 
No. of trials n; 5 6 19 20 17 #15 8 1 O 
Perform a two-tailed test to check whether the data are consistent with a Poisson 


process. 


7.4 According to a genetic model, a certain tree should provide a green to yellow 
pea ratio of 3:1. In a sample of 500, 356 green peas were found. Say whether the 
model can be accepted at the 5% level. 


7.5 A series of 600 rolls with 2 different dice gave the following contingency table 
for the 6 sides: 


1 2 3 4 5 6 TOTAL 
DiE1l 101 105 103 95 90 106 600 
DIE 2 99 105 98 103 96 99 600 
TOTAL 200 210 201 198 186 205 1200 


Check whether (a) the two series of rolls are compatible with each other and (b) dice 
and rolls are fair. 


316 7 Basic Statistics: Hypothesis Testing 


7.6 After the introduction of a new highway speed limit, car accidents during 
weekends decreased from 60 to 33. What is the probability to be wrong when 
affirming that the decrease is due to the new limit? 


7.7 A set of cosmic ray detectors collected the following counts in the same time 
interval: 


Detector Ly 42.3. a4 2S - <6. 
Counts 29 19 18 25 17 10 


Check whether the flux on the counters is homogeneous at a 5% test level. 


7.8 A measurement of the emission rates of two radioactive sources A and B gave 
the following result: 


A = 240 counts in 10 s 
B=670counts in 10 s 


One experimenter claims to have obtained the result 
C = 10500 counts in 100 s 


in anew measurement with the A and B sources combined. Evaluate if this statement 
1s correct. 


7.9 The following 47 split times refer to the working intervals observed between 
consecutive failures of an electronic device: 


Interval O0-—120h 120—240h 240—360h 360 —480h 
Frequency 22 12 7 6 


Verify whether the data support the hypothesis of an exponential law with A = 
0.005 h! at a test level of 1%. 


7.10 During a counting experiment, 1000 arrival times are recorded over fixed time 
intervals. The final result is: 


Interval [(O—1]s [1—2]s [2-—4]s > 4s 


Trials 368 266 217 149 


Verify whether the data are in agreement with a true frequency of 0.5 s~!. 


7.11 Problems 317 


7.11 Four measurements of a given physical quantity give the following result: 
1.12 1.13 1.10 1.09. 


Verify the compatibility among the measurements assuming that they come from 
normal densities with the same variance o? = 4 1074. 


7.12 The law sets a limit on the concentration of a certain air pollutant to 55 parts 
per million (ppm). A series of ten repeated measurements gave an average value of 
58 ppm. Knowing that the uncertainty of a single measurement (standard deviation) 
is +10 ppm, check if the data exceeds the allowable limits at a test level of 5%. 


7.13 An experiment gives the following result: 
13 values < —1, 25 valuesin[—1,0], 44 valuesin(0,1], 16 values > 1. 
Check whether the data are in agreement with the standard normal N(0, 1). 


7.14 In six specimen of a 100 meter long high voltage cable 

18, 14, 10, 10, 21, 17, 
defective points of insulation have been detected. 
Evaluate whether these data can be considered in agreement with a standard of less 
than 15 defects per 100 meters. 
7.15 After the administration of a drug, 15 parameters related to the health of the 
tested subjects were followed and compared with a control group. The 15 p-values 


from the t-tests are, in ascending order, as follows [BH95]: 


0.0001, 0.0004, 0.0019, 0.0095, 0.0201, 0.0278, 0.0298, 0.0344 
0.0459, 0.3240, 0.4262, 0.5719, 0.6528, 0.7590, 1.000. 


Use the routine MultiTest to find which parameters are sensitive to the drug, at 
a test level of 5% and 1%. 


7.16 Perform the t-test on the data on the breaks in the warp depending on the 
tension of the loom given in Table 7.5. 


7.17 Calculate the p-values of Table 7.8 with the R routine ptukey. 


318 7 Basic Statistics: Hypothesis Testing 


7.18 Two creams A and B and a placebo P were given to 25 patients with blisters. 
The days needed for the healing are shown in the table: 


Treatment Days 


A 56677 8 9 10 
B 77889 10 10 11 
P 7999 10 10 11 12 13 


Make the ANOVA analysis of the data. 


Chapter 8 ®) 
Monte Carlo Methods Ghost for 


It is the powerful development and intensive use of the 
simulative function that, in my view, characterizes the unique 
properties of man’s brain. And this is at the most basic level of 
the cognitive functions, those on which language rests and 
which it probably reveals only incompletely. 


Jacques Monod, “CHANCE AND NECESSITY: ESSAY ON THE 
NATURAL PHILOSOPHY OF MODERN BIOLOGY”. 


8.1 Introduction 


The term “Monte Carlo methods” or “MC methods” generally refers to all those 
techniques that make use of artificial (i.e. computer generated) random variables to 
solve mathematical problems using random samples drawn from the corresponding 
populations. 

Undoubtedly, this is not a very efficient way to obtain the solution of a problem, 
as the (often) time-consuming simulated sampling procedure gives a result that is 
always affected by the statistical error. In practice, however, we are often faced with 
situations in which it is too difficult, if not impossible, to use the standard numerical 
or analytical procedures, and in all these cases, Monte Carlo methods become the 
only available alternative. 

The application of these methods is not limited only to purely statistical 
problems, as it might seem from the use of probability distributions, but includes 
all those cases in which a connection can be found between the problem under 
consideration and the behaviour of a certain random system: for example, the value 
of a definite integral, which is certainly not a random quantity, can also be calculated 
using random numbers. 

The theoretical foundations of the Monte Carlo methods (or simulation methods) 
have been known for a long time, and the first example of the use of random 
numbers for the resolution of definite integrals is even found in a book (Essai 
d’aritmetique moral) written in 1777 by Georges-Louis Leclerc, Comte de Buffon, 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 319 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per il 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_8 


320 8 Monte Carlo Methods 


French mathematician and naturalist, in which a procedure for the approximate 
calculation of the z value of is outlined (see Problem 8.13). 

For over a century and a half, however, it was used only sporadically and above 
all for didactic purposes. Its first systematic application took place only in the first 
half of the 1940s in Los Alamos, by the team of scientists, led by Enrico Fermi, 
who developed the project of the first atomic bombs. In this same period, the term 
“Monte Carlo” was also born, which obviously refers to the town famous for its 
casino and, more precisely, to roulette, one of the simplest mechanical devices that 
can be used to generate random variables. 

The authorship of the name is in particular attributed to the mathematicians J. von 
Neumann and S. Ulam, who adopted it as the code name of their secret research, 
conducted using random numbers, on the processes of diffusion and absorption of 
neutrons in fissile materials (for more historical information, see [HH64]). 

After 1950, these methods passed in a few years from the role of mathematical 
curiosity to that of an indispensable tool for scientific research thanks to the advent 
of the computers. This has happened not only because computers provide a rapid 
execution of the long calculations that are often necessary to obtain a meaningful 
result but also, as we will see later, because they can easily generate random 
numbers. Currently, there are applications in many different research fields, from 
nuclear physics to chemistry, from statistical and quantum mechanics to economics. 

In this chapter we give a description of the fundamental principles of these 
methods and the most important technical details necessary to create Monte Carlo 
codes for the solution of statistical problems. Other significant applications are 
explained later in Chap. 9. 


8.2 What Is Monte Carlo? 


Any computer library system includes service routines, generating numbers uni- 
formly distributed in [0,1], which we can consider random. As we will discuss 
shortly, this particular distribution is needed to perform any simulation. The 
realization of this type of routine, which may seem simple on the surface, actually 
constitutes a complex mathematical problem, as can be verified by consulting 
references [BS91, Cha75, Com91, FLJW92, Jam90, MNZ90, PFTW92], whose 
description goes beyond the scope of this book. In R, as we have already mentioned, 
the uniform generator is the routine: 


x = runif(n,min=0, max=1) 


that, by default, produces n values in [0, 1]. Numbers within any range can be 
generated by changing the values of min and max. For example, to generate a vector 
of 2000 numbers and make their histogram, just write: 


> hist (runif(2000)) , 


8.2. What Is Monte Carlo? 321 


or you can use our HistoBar routine, which offers some options, such as the 
calculation of error bars, and other possibilities that you can find commented and 
described in the code: 


> HistoBar (runif (2000) ,grid=TRUE, errors=’ ON’ ) 


To draw curves, one can also use the plot function, and the density routine, 
which is described in Appendix B: 


> plot (density (runif (2000),adj=0.01)) , 


where the degree of smoothing is tuned by the adj parameter. 

The R library provides routines to generate random numbers extracted from all 
distributions of current use. As explained in Appendix B, the prefix r must be 
used before the name of the required density. All the routines that generate random 
numbers must be initialized with one or more integer numbers, called seeds. If seeds 
are not changed, at each new call of the program, the results (even if random) will 
always be the same. In other words, the same seed will always generate the same 
sample. In R, the seeds are automatically renewed at each simulation, taking the 
current value provided by the system software. If you want to fix the seed, which 
can sometimes be useful in the testing phase of a code, the instruction to be placed 
at the beginning of your program is the following: 


> set.seed (123432) , 


where the seed 123432 is arbitrary and must be changed when a different sequence 
is requested. One can easily verify that by typing: 


> set.seed(13567); runif (10) 
> set.seed(13567); runif(10) , 


exactly the same sequence of ten random numbers is repeated twice. From now 
on: 


e With the symbol € (with or without indexes), we will always denote a uniform 
random variable in the interval [0, 1]. 

e With € (with or without indexes), we will denote the values assumed by &€ ~ 
U(0, 1) in one or more specific trials. In practice, these are the uniform variates 
or numerical values supplied by the runif routine. 


To simulate the simplest experiment, the flip of a coin, then just divide the unit 
interval in half and, for each generated number &, define the event as “head” when 
0 < € < 0.5 and as “cross” when 0.5 < & < 1 (of course, the inverse convention 
would work just as well). The R lines, contained into our routine MCcoin, are: 


# simulation of nsim tosses of 10 coins 
heads <- seq(0,0,length.out=nsim) 
for( j in 1:nsim}{ 

x <- runif (10) 

heads[j] = length (which(x<0.5) ) 


} 


HistoBar (heads, nbins=11,minx=0,maxx=11,errors='ON’ ) 


322 8 Monte Carlo Methods 


Fig. 8.1. Number of heads 28 


obtained in 100 simulated 
flips of 10 coins. Compare the 244 
result with that of the real 
experiment in Fig. 7.6 204 
ie | i 
2 | | 
7 | 
: | | 
| 
0 T T T T | 
1 3 5 7 9 11 
mean = 4.83 +— 0.17 std dev. = 1.74 +— 0.12 


Notice the use of the R function which, that extracts a vector containing the 
positions of the values < 0.5 from the vector x; the length of this vector is the 
number of heads obtained, which is stored in heads. At the end of the loop, if 
we denote by n; the contents of a generic bin of the histogram, the ratio n;/N 
gives us, for each bin, the estimate of the probability. The histogram obtained from 
HistoBar routine is shown in Fig. 8.1. It should be noted that by simulating the 
ten coin flipping, we do nothing but generate a random variable from the binomial 
distribution (2.29), and, for N — ov, the frequency histogram we obtain tends to be 
just b(x; 10,0.5). You can easily verify this, and obtain again the same histogram 
of Fig. 8.1, with the instructions: 


> X <- rbinom(100,size=10,prob=0.5) 
> HistoBar (x,min=0,maxx=10,nbins=11,errors=’ON’ ) 


where the R routine rbinom has been used to generate binomial variables. Since, 
for obvious reasons, we can repeat this algorithm only for a finite number of times, 
the simulation gives frequencies affected by a certain error, as evidenced by the 
error bars displayed in Fig. 8.1 and calculated with the procedure of Sect. 6.14. Of 
course the accuracy of our estimate increases with N, but the improvement we 
get is not very high: the histogram error bars, calculated with Eq. (6.106), show 
that, by quadrupling the number of events, we only halve the error and, as we 
will demonstrate shortly, this trend is not related to this particular example but is 
a general characteristic of all MC calculations. 

In the next section, we will give a theoretical justification of MC methods, 
rewriting the rather qualitative indications we have developed so far in a more 
general and mathematically correct way. 


8.3. Mathematical Aspects 323 
8.3. Mathematical Aspects 


We can view a variable T associated with any stochastic phenomenon as a function 
of k random variables (X;, X2,..., Xx): 


T = f(X%1, X2,..., Xk). (8.1) 


As we explained in the previous chapters, the whole process is, ultimately, char- 
acterized by the J mean value and dispersion parameters; the latter are in turn 
expressible, on the basis of Eq. (2.67), as the difference of mean values. The 
phenomenon under consideration can therefore always be described on the basis of 
the expected values (2.68), through the solution of one or more integrals (or sums) 
of the type: 


I=(T) = f Fei.) P(X, X2,..., XK) dx... Arg, (8.2) 
D 


where p(x1,X2,..., Xx) 1s the p.d.f. of the variables (X1, X2,..., Xx), defined in 
D € R¥* and with: 


[pore der den = 1, (8.3) 
D 


From a strictly formal point of view, all MC calculations, which are simulations of 
stochastic processes in which events are randomly generated according to specified 
probability distributions, are equivalent to the approximation of the value of a 
definite integral or sum. 

If we generate N values of the random variable T, using several independent sets 
of random numbers extracted from the density p, we know, from the properties of 
the mean of a sample (see Sects. 2.11 and 6.9), that the quantity: 


N 
1 

Ty =— i, X2;,--+,Xki), 8.4 

N w 2, Feu X2i Xki) (8.4) 


is a correct and unbiased estimate of /. 

If we assume that the distribution of the random variable 7, as is almost 
always verified in practice, has a finite variance Gn. by applying the Chebyshev’s 
inequality (3.92) and the formula (6.49) for the variance of the mean, we obtain the 
relation: 


1 
Pigs Veto: Bee, (8.5) 
JN e 


324 8 Monte Carlo Methods 


Another very useful property (always assuming a finite gz) is provided by the 
Central Limit Theorem 3.1, according to which the p.d.f. of Ty tends “asymptot- 
ically” to a Gaussian with mean J and standard deviation or /VN . As we have 
already noticed, the asymptotic requirement is actually satisfied for N low enough 
(N => 10); so we can almost always write that: 


lity — 11 = 22} = 0997, (8.6) 


JN 


Equations (8.5, 8.6) show that the simulated values converge in probability towards 
the quantity to estimate J since their variance is op /N. It therefore turns out that 
(except in the cases in which, as we will mention later, it is possible to “manipulate” 
T by reducing its variance) the only way to increase the precision of the Ty estimate 
is to increase the number N of simulated events. This slow convergence of MC 
estimators can require considerable computational time in the simulation of very 
complex systems. 


8.4 Generation of Discrete Random Variables 


If the density function of the discrete variable to be generated is known, then, 
bearing in mind that: 


¢ Itis possible to construct, with Eq. (2.27), the cumulative function. 
e Equations (3.87, 3.89) hold. 
e¢ The runif routine is available. 


it is immediate to realize that the most efficient method for simulating discrete 
variables consists in extracting a random variate 0 < & < 1, considering it as 
a cumulative variable and determining the quantile value corresponding to the 
extracted cumulative value. 

Indeed, let us consider a segment of unit length, divide it into k intervals and then 
assign to each of them a length p; equal to the probability for the corresponding 
event {X = x;} to occur (see Fig. 8.2). Since the probability for a uniform variate 
0 < & < 1 to fall within a particular interval is exactly equal to the length of that 


0 1 


Fig. 8.2. Generation of discrete random variables. The unit interval is divided into k segments 
of length pj, p2,..., Px, and the subinterval where a random number & is located identifies the 
generated event 


8.4 Generation of Discrete Random Variables 325 


interval: 
Pi0O<€<pij=pi, (8.7) 
Pip <&<pit+pas=pr, (8.8) 
Pipi + pot...+ pr-1<&<Y=px, (8.9) 


the value x; corresponding to the upper extreme p; + p2 +...+ pj; of the interval 
which contains the random number & from the runif routine must be considered 
as extracted. The use of the cumulative is the basic method common to all MC 
simulations. This procedure can be summarized as follows: 


Algorithm 8.1 (Generation of Discrete Variables) To generate a discrete random 
variable X, which can take a finite set of spectral values x1, x2,...,X with 
probabilities p,, p2,..., Dk, is necessary: 


* To evaluate the cumulative function Fj: 


i 
R=) (j =1,2,...k) (8.10) 
i=l 


¢ To generate a random variate 0 <é <1 
¢ To determine the index j (1 < j <k) satisfying the inequality: 


Fy-1 <€ < Fj (8.11) 


(when j = 1, one defines F;-, = 0) 
* To set{X = xj} 


This algorithm, based on inequality (8.11), requires that the vector containing the 
cumulative data has zero in the first position and one in the last. Therefore, if the 
discrete values of the spectrum form a vector of dimension k, as in Eq. (8.10), the 
vector of the cumulative must have k + | values. 

To minimize the computational time needed to solve Eq. (8.11), it is necessary to 
have an efficient routine to find the index j — 1 corresponding to the value extracted 
from runif (1), that is, a search algorithm on the intervals “closed on the left”. 
Obviously, the least time-efficient method is the sequential search, which is never 
used by the routines present in statistical software. 

The available algorithms mostly belong to the so-called binary (or dichotomic) 
search family on which there is a very large literature [KR88, PFTW92]. The 
computation starts from the subinterval corresponding to the central index of the 
vector; if the target index is greater (smaller), the first (second) half of the vector 
F is eliminated, and the search continues on the centre of the remaining half, again 
starting from the middle index. 


326 8 Monte Carlo Methods 


The number of comparisons needed to find this index is equal to the minimum 
integer m verifying the inequality 2” > k (Gif k = 1000, for example, m = 10). 
Therefore, the search time goes as t ~ k/2 in the case of the sequential search and 
as t  In(k — 2) for the dichotomic one. 

In R, it is possible to use different search strategies to implement Eq. (8.11). Since 
this can be an important aspect for all those who perform simulations, let us analyse 
the performance of the routines that scan a vector x sorted in ascending order to 
determine the interval containing a y value: findInterval (y,x) andmax(0, 
which (y>x) ). The former is perhaps the most used R routine for dichotomic 
searches; the second is a possible workaround: which finds all positions of the 
vector x that have values < y and then max selects the maximum of the list since x 
is a vector sorted in ascending order. The computation time of the two methods can 
be evaluated with the use of the system. time routine with the following in line 
code: 


k = 100 

x <- runif (k) 

x <- sort (x) 

n = 500000 

test<-function(n,x)for(j in 1:n) {y=runif (1) ;z=findInterval (y,x) ;} 
testl<-function(n,x)for(j in 1:n) {y=runif (1) ;z=max(0,which(y>x) ) ;} 
system.time (test (n,x) ) 

system.time(test1(n,x) ) 


VV VV VV VV 


With these values and on the PC we have used, findInterval takes 2.89 s, 
whereas max-which takes 2.34 s. Obviously, these times linearly increase with 
the number of comparisons n while, changing the number of positions k from 100 
to 500, findInterval takes 3.32 s and while max-which takes 3.67 s. The 
time increment roughly follows the dichotomic search rule ¢ ~ In(k — 2) for both 
algorithms, but we see that the pair of routines max-which seems to be a little 
faster for low values of k and slightly slower for large values. Both solutions are 
efficient, after all. 


Exercise 8.1 
Write a code to simulate the rolling of a pair of fair dice. 


Answer We must first evaluate the probability distribution and the cumulative 
function F associated with each score. If we denote with by S the sum of 
the points obtained in a single roll, these probabilities are represented by 
Table 8.1. To solve the problem, we use our code MCdices: 
MCdices<- function (nsim=1000,grid=TRUE) { 
cumul <- c(0.,1/36,3/36,6/36,10/36,15/36,21/36, 


26/36,30/36,33/36,35/36,1.) 
cores <- seq(2,12,length.out=11) 


(continued) 


8.4 Generation of Discrete Random Variables 


Exercise 8.1 (continued) 
tosses <- seq(0,0,length.out=11) 


for( j in 1:nsim) { 

x = Cunt (1) 

ind = findInterval (x, cumul) 

tosses[ind] = tosses[ind] + 1 } 
meansc=MeanHisto (scores, tosses) 
stdsc=sqrt (VarHisto (scores, tosses) ) 
ermean = stdsc/sqrt (sum(tosses) ) 
erstd = stdsc/sqrt (2*sum(tosses) ) 


output <- paste("mean = ",round(meansc,digits=3)," +-", 
round (ermean, digits=3), 
" gigma = ",round(stdsc,digits=3),"+-", 


round (erstd,digits=3) ) 


HistoBar (scores, tosses,errors=’ON’,grid=TRUE, xex=output) 


alae ((Gpaidcl=inizinia)) epesil(()) 


} 


which produces the results of Fig.8.3, relative to 20000 rolls. From 


327 


Eq. (8.6), we can estimate from the histogram the score mean value (S) of 


a pair of dice as: 


(S) = 7.00+ 0.03, CL ~ 99.7%. 


Table 8.1 Probability 


. L 
distribution p;( = 1,...,11) ee Ee 2 
and cumulative function F; u 2 1186 | 1/36 
(F; = Via p;) of the score 2 3 | 2/36 | 3/36 
S; resulting from the roll of a 3 4 | 3/36 | 6/36 
pair of dice 4 | 5 | 4/36 | 10/36 
5 6 | 5/36 | 15/36 
6 7 | 6/36 | 21/36 
7 8 | 5/36 | 26/36 
8 9 | 4/36 | 30/36 
9 |10 | 3/36 | 33/36 
10 | 11 | 2/36 | 35/36 
11 }12 | 1/36 }1 


328 8 Monte Carlo Methods 


score 


0 7 T T T T T T T T T 
1 3 5 7 9 11 13 
mean = 7.00 +— 0.01 std dev. = 2.42 + 0.01 


Fig. 8.3. Histogram of 20,000 simulated rolls of a pair of dice 


8.5 Generation of Continuous Random Variables 


In the case of continuous variables, it is sufficient to invoke again Theorem 3.5 to 
immediately arrive at the same type of procedure that we have just used for discrete 
variables. 

Let us consider a generic p.d.f. p(x), having cumulative F(x) and defined on any 
interval [a, b]. If we randomly generate { = &} and calculate the corresponding 
value: 


x = F7'(é), (8.12) 


we obtain, from Eq. (3.87), a random generation of X according to p(x). 

To generate random variables from any density, it is therefore sufficient to solve 
an integral and invert the obtained function, using an algorithm that we can be stated 
as follows: 


Algorithm 8.2 (Inverse Transformation) To generate a continuous random vari- 
able X, distributed as p(x) and defined in the interval [a, b], it is necessary: 


¢ To generate a variate0 <é <1 
¢ To solve, with respect to x, the equation: 


/ “p@arse (8.13) 


8.5 Generation of Continuous Random Variables 329 


The following exercises help you to get familiar with this procedure, which is of 
fundamental importance for the MC methods. We suggest also to look again at 
Exercise 3.12. 


Exercise 8.2 
Generate a uniform random variable X within the interval [a,b]. 


Answer From Eq. (3.79), one has: 


1 

———— 8.14 
IO = a (8.14) 

and Eq. (8.13) becomes: 

ele 
eS ee (8.15) 
ea 

Hence, the final result is: 

x=a+é&(b—-a). (8.16) 


In R, Eq. (8.16) can be implemented with the calling string: x = runif (1, 
min=a, max=b). 


Exercise 8.3 
Randomly generate points uniformly distributed in a circle of radius R with 
constant density p (points/cm”). 


Answer To define the position of a generic point P within a circle, it is 
convenient to use the pair of polar coordinates r and g (withO <r < R 
and 0 < » < 27). 

To determine their probability densities p(g) and q(r), we first observe 
that p(~) dg is given by the ratio between the number of points contained 
in the infinitesimal dashed sector of Fig. 8.4a and the total number of points 
contained in the circle surface: 


pR*dg/2__ dy 


= ° Al 
pm R2 Dig Ce 


P(g) dy = 


(continued) 


330 8 Monte Carlo Methods 


Exercise 8.3 (continued) 
Analogously, for g(r), we obtain (see Fig. 8.4b): 


p2zsr dr 2r 
TOM gs = a (8.18) 


The corresponding cumulative functions are: 


“ g 

HP) =f pedo = 2, (8.19) 
ip r2 

SO = [ ginar = (8.20) 


Applying again Algorithm 8.2, we obtain: 


gy = 27 & 
eae (8.21) 


where &; and &2 are two independent uniform random numbers. 
These equations can be quickly checked with the following code, where 
IR == 2 


phi <- 2x*pixrunif (1000) # pi is pigrec in the R software 
rho <- 2xsqrt (runif (1000) ) 

x <- rhoxcos (phi) 

y <- vhoxsin (phi) 

OULCIE (ee, sr, elN=" <” , (eSee=2 15) 


NEON i NG oe 


It is interesting to notice that an isotropic point distribution in the circle 
implies uniformity in g but not in r. This can be intuitively justified by looking 
at Fig. 8.4b. Uniformity in r would mean to have the same number of points 
for two circular sectors with different radii and then a higher density for the 
one closer to the centre. Since € < 1, the square root operation “pulls” points 
closer to the circumference to fulfil the isotropy condition. 


8.5 Generation of Continuous Random Variables 331 


a aN 
) Y Y 
dg 


Fig. 8.4 (a) When points are uniformly distributed in the circle, p(g~) dg is equal to the ratio 
between the infinitesimal dashed area and the total number of points on the circular surface. (b) As 
in the previous figure, but for g(r) dr 


Exercise 8.4 
Generate points isotropically distributed on a spherical surface of radius R 
and uniformly within the spherical volume. 


Answer The position of a generic point P located on a spherical surface 
(that, without losing generality, we assume to be at the centre of a system 
of Cartesian axes) is usually defined (as in Fig. 8.5a) by the three coordinates 
(r, g, 3) with: 


0<g<2z (g = azimuthal angle) 
O0<v0<z2 (% =polar angle) (8.22) 
O<r<R_ (r =radial distance) . 


The formulae giving the transformation of the point coordinates to the 
orthogonal system XY Z are: 


x =rsind cos 
y=rsinv sing (8.23) 
Z=rcost. 


Now let us consider the infinitesimal spherical volume: 
dV =r*dQdr =r’ sind dd dgdr . (8.24) 


If we now denote by n;o; and dn the total number of points on the sphere 
and in dV, respectively, for the isotropy condition, we must assume that the 


(continued) 


332 8 Monte Carlo Methods 


Exercise 8.4 (continued) 
density points in the sphere are constant and that the probability p(V) dV for 
any point to be within dV is: 


dn dV r’sind dd dgdr 
(V)dV = ee (8.25) 
i Ntot V 47 R3 


The probabilities p(g) dy, g(a) dvd and p(r) dr for a point to be inside the 
interval [y, g + dg] for any % and r, inside the interval[?, 7 + di] for any 
g and r and inside the interval [r, r + dr] for any % and ¢, are evaluated with 
the corresponding marginal densities, defined in Eq. (4.11): 


1 Ss 1 

p(y) dy = — ay | / r’? sind dd dr = — dg, (8.26) 
4 0 0 20 
i ee R pon sin 3 

q(d) dd? = — sind dd / i dy dr = dv , (8.27) 
V 0 Jo D 


1 2n 1 3 
p(r) dr = we ar | [ sin? dd dg = ar dr, (8.28) 


where V = (4/3)2 R°. The corresponding cumulative functions are: 


g 


Bi = 1P@) = S° (8.29) 

& = 9(0) = SS" (8.30) 
r3 

& =Rr)= TS (8.31) 


Finally, the formulae for the random generation of ¢, } and r are respectively 
given by: 


gp = 27& 
3 = acos(1 — 2&2) (8.32) 
r= Ret? 


If all three equations are considered, a uniform distribution within the sphere 
is obtained. If we set r = R constant, the first two formulae give the angles 
that, inserted in the first two of Eqs. (8.23), generate the isotropic distribution 
of points on the spherical surface. 


(continued) 


8.5 Generation of Continuous Random Variables 333 


Exercise 8.4 (continued) 

So, remember the following not very intuitive consideration: isotropy on 
the spherical surface means uniformity in g but not in 7. In fact, from the 
previous equations, it is easy to realize that the isotropy condition is satisfied 
when € = (1 — cos?) (or € = cos#), and not #, is a uniformly distributed 
variable. As before, this property can be qualitatively understood from a 
geometric point of view if we consider the two hatched surfaces S; and 
So of Fig. 8.5b, which are respectively included between [%1, 1 + dv] and 
[¥2, Jo+dv]. A uniform generation in % would give roughly the same number 
of points on the two surfaces, but, since the area of Sz is much larger than that 
of $;, the resulting point density would be much greater at the poles than at 
the equator. When the generation is uniform inside the sphere volume, the 
cubic root transformation for the generation of r acts in the same way as in 
the circle case. 

These formulae can be applied and verified with our MCsphere routine. 
Figure 8.6 shows a result obtained with the generation inside a spherical 
volume. 


Just considering the previous exercises, Algorithm 8.2 would seem to solve all 
random generation problems. In reality, the situation is not so simple, because the 
density p(x) to be integrated could be known only numerically, or the integral 
appearing in the left-hand term of Eq. (8.13) might result in a function that is 
not analytically invertible. In all these cases, there is a wide variety of alternative 
procedures to be used. Most of these methods are already implemented in R, as well 
as in the other statistical software, to generate random numbers from many different 
distributions. 


a) 


Fig. 8.5 (a) The generic point P on the spherical surface is identified by the three coordinates 
(R, ¥, yg). (b) To obtain the same point density on the surfaces S; and $2, a uniform variable 
(1 — cos #) or cos ? must be sampled. The uniform sampling of % would give a different point 
density 


334 8 Monte Carlo Methods 


1.0 


0.5 


-1.0 


Fig. 8.6 Uniform simulated point distribution inside a spherical volume from the MCsphere 
routine 


In the next sections, we will have a detailed look at some of the more commonly 
used random generation techniques. This will enable you to deal with problems 
where particular densities, not included in the most used statistical packages, may 
be involved. 


8.6 Linear Search Method 


When the cumulative F(x) cannot be represented analytically, we can always 
numerically compute the integral: 


Xx 
F(Qx)= f D(x) dx (8.33) 
a 
and obtain this function at N different points x1, x2,...,x, with x] = a andxy = 
b (see Fig. 8.7): 
F(x) = 0 


F(x2) = F(x) + fy? p@) dx , 


F(x3) = F(x2) + i p(x) dx , (8.34) 


F(xy) = F(xy-) + 0%. p(x) dx = 1. 


XN-1 


8.6 Linear Search Method 335 


Fig. 8.7 Generation of 
random variables using the 
linear search method. The 
cumulative function F(x) is 
evaluated on a set of points to 
go back to the case of the 
generation of a discrete 
variable 


| p(x) 


FX) 
F(X) 


FR) 
F(X;) 


In this way, we have returned to the random generation of a discrete variable, 
already described in Sect. 8.4. Indeed, through the relation: 


itl Jj 


Fj-1= >) poi) <&< op@=F), ss <Q) (8.35) 


i=1 i=1 


we are immediately able to determine the particular subinterval [x ;-1, x; ] contain- 
ing the random variable we have to generate. 

Assuming F(x) to be linear within each subinterval, we can then obtain the 
desired random value with the linear interpolation between the limits x;_; and x; 
of the selected subinterval: 


E> Fiji 


TSS =F) 


(xj — xj-1). (8.36) 


Since the cumulative F(x) is a monotonically increasing regular function, the 
approximation (8.36) generally gives correct results. All these considerations are 
summarized in the following procedure: 


Algorithm 8.3 (Linear Search) To generate a random variable X, distributed as 
p(x) and defined in the limited or unlimited interval [a, b], it is necessary: 


¢ To calculate numerically by points N values of the cumulative function 
X1,X2,...,XN (X] =a; xy =b): 


7 
F=f pid WG en (8.37) 
a 


¢ To generate a variate0 <é <1 


336 8 Monte Carlo Methods 


* To determine the index j suchas Fj-, < & < Fj 
¢ To calculate x through Eq. (8.36) 


With this method, the cumulative does not need to be inverted, and numerical 
methods can be used when the integral is difficult or impossible to compute 
analytically. 

However, a relevant number of points could be needed to precisely reproduce the 
F(x) behaviour. Moreover, a non-negligible time may be required by the integral 
calculation or by the search required to determine the correct index j. 


8.7 Rejection Method 


This method is based on the property of the definite integrals to be the area between 
the integrand function and the x axis. The procedure starts by randomly choosing 
uniformly distributed points within a rectangle (delimited by the vertices A, B, 
C, D in Fig. 8.8), with base (b — a) and height h, which encloses the considered 
probability density p(x) (obviously / must be greater than or equal to the maximum 
Pmax assumed by p(x) in [a, b]). From Eq. (8.16), the generic coordinates (x;, y;) 
of these points are: 


xj =at+é(b—-a) 
8.38 
2 oa 
with 0 <&,£ <1. 
Y 
er c D 
Pmax i | 
Y¥|) Sf | 
A B 
a xX, b x 


Fig. 8.8 Generation of random variables with the rejection method. The probability density p(x) 
is bounded within a rectangle (or within a function f(x)) and a uniformly generated point within 
it is “accepted” if it is in the region defined by p(x) and the abscissa axis (hatched area) 


8.7 Rejection Method 337 


Let us now consider the probability for the uniform random variable a < X < b 
to fall within the infinitesimal interval [x;, x; + dx]: 


P{xi < X <xj+dx} = Pix <a+&(b—-a) <xj+ dx} = (8.39) 


dx 
(b—a)’ 
and the conditional probability that, for a given x;, a uniform variable 0 < Y < h is 
less than p(x;): 


P{Y < p(@x)|xi < X < xj +dx}~P{Y < p(xi)} 


= P{& < p(xi)/h} = a . (8.40) 


According to Theorem 1.21 of compound probabilities, the probability for both the 
previous events to occur is: 


P{xj < X <x; +dx,Y < p(xi)} 
= P{Y < p(x) |xj < X < xj +dx}- P{xj < X < xj + dx} 


p(xi) dx 
~ —— = ep(x;) dx. 8.41 

ib may 7 EPO (8.41) 
Apart from the constant factor e = 1/[(h(b — a)], this formula coincides with the 
probability for X to be in (xj, x; + dx). 

These simple considerations are the basis of the rejection technique, which 
applies Eq. (8.41) by trial and error and which we can thus state as: 


Algorithm 8.4 (Simple Rejection Algorithm) To generate a continuous random 
variable X, having p.d.f. p(x) defined in the finite interval [a, b] with maximum 
value Pmax, it is necessary: 


¢ To uniformly generate a random point x € [a, b]: x =a+&\(b —a). 

¢ To calculate p(x). 

¢ To uniformly generate a random point y within 0 and h (h > pmax): y = 2h. 

¢ Ify < p(x), then x is accepted; otherwise, it is rejected, and the procedure is 
restarted from the beginning. 


Clearly, to apply this procedure, it is just necessary to know the analytic expression 
of p(x), while no information on its cumulative is needed; however, the price to pay 
is that at least two uniform random numbers are required to generate a variable from 
the density p(x). 

The constant ¢ = 1/[h(b — a)] is the ratio between the number of accepted 
points and the totality of those generated and represents the generation efficiency 
or, equivalently, the inverse of the average number of attempts required to accept 
a point. It is also equal to the ratio between the area under p(x) and that of the 
rectangle ABCD of Fig. 8.8. To optimize the method, it would then be necessary to 
generate all points (to be accepted or discarded) no longer within a rectangle but 


338 8 Monte Carlo Methods 


within a curve f(x), which contains p(x) and mimic its behaviour, in such a way as 
to maximize the ratio between the respective areas (see Fig. 8.8). This generalization 
makes the rejection method also valid for functions defined in an unlimited range 
as it suffices to find a function f(x) also defined in the same range. To have a 
simple (and above all fast) sampling procedure, it is necessary for f(x) to have an 
analytically invertible cumulative F(x). If so, using the equations: 
-1 
| xj = F (1) (8.42) 
yi = &f (i) , 


a point is sampled within f(x) which, as we have just seen, is accepted if y; < 
p(xi). 

To formally demonstrate this new procedure, Eqs. (8.39) and (8.40) must be 
rewritten, which now become, respectively: 


Sf (xi) dx 
P{x;j < X <xj+dx}~ a arate (8.43) 
Jaq f(x) dx 
and: 
P P(xi) 
{Y < p(x)|xj < X < xj + dx} ~ P{& fai) < pai} = Fu) (8.44) 
l 
After inserting these last two relations in Eq. (8.41), one obtains the result: 
p(xi) dx 
Pix; < X <x; -+dx,Y < p(i)} = =——— & p@) dx. (8.45) 
Sa f(x) dx 


The efficiency ¢ of this generalized method is obtained by integrating the previous 
relation over all x values: 


b 
ea da p(x)dx _ 1 


; =———_—_ + (8.46) 
Sy Fxydx fp fx) dx 
In this case, the constant ¢, which in Eq. (8.41) was the inverse of the area h(b — a) 
of the rectangle ABCD, now is the inverse of the integral of f (x), i.e. the area under 
this function. 

The previous algorithm can then be generalized as: 


Algorithm 8.5 (Optimized Rejection) To generate a random variable X, dis- 
tributed as p(x), which is defined in the (limited or unlimited) interval [a,b] and 
bounded by a function f (x) having an analytically invertible cumulative function 
F(x), one needs: 


¢ To randomly generate a point x € [a,b] having density proportional to f (x): 
x = F'()). 


8.7 Rejection Method 339 


¢ To calculate p(x). 

¢ To uniformly generate a random point y between 0 and f (x): 
y = &2f (x). 

¢ Ify < p(x), x is accepted; otherwise, it is rejected, and the procedure restarts 
from the beginning. 


It is also possible to formulate a third version of the rejection method (devised 
by the American mathematician J. von Neumann in the 1950s) when the function 
P(x), from which the variable X must be generated, can be factored as the product 
of two functions: 


P(X) = g(x)h) , (8.47) 


where f(x) has an analytically invertible cumulative H(x) and g(x) is limited 
within the interval [a, b]. In this case, we can write: 


ee ale (8.48) 
ib h(x) dx 
and: 
O<g(x)<G. (8.49) 


As usual, we sample two random uniform numbers &| and &2 and define the 
conditions: 


xj = H-'(&) 
F (8.50) 
fo < ae : 


Equations (8.43) and (8.44) now become: 


h(x;) dx 


Pix; = X <x; +dx} = ———_.,,, 
i h(x) dx 


(8.51) 


P{& < g(x)/Glxi < X <x; +dx} = P{& < g(xi)/G},=g(xi)/G (8.52) 


and hence: 


Pixj <X < xj) +x, & < g(x)/G} = A(xi)gci)dx. (8.53) 


Gf? h(x) de 


340 8 Monte Carlo Methods 


Also in this case, the efficiency ¢ is obtained, by integration over all x values: 


b b 
h(x)g(x) dx G | h(x) dx 
pe ee (8.54) 
Gf. h(x) dx Gf h(x) dx 
If p(x) is normalized, then yes h(x)g(x) dx = 1, and the efficiency simply results 
in: 


1 


= —_— <1 8.55 
: G [? h(x) dx > oe 
Equation (8.47) can be rewritten as: 
b 
p(x) =G | h(x) dx - ee = Ch*(x)g*(x) , (8.56) 
a fi, h(x)dx G 


where h* (x) is a normalized density and 0 < g*(x) < 1. The constant C, if p(x) is 
normalized, is the inverse of the efficiency, and therefore it must satisfy the condition 
C>1. 

The algorithm then becomes: 


Algorithm 8.6 (Weighted Rejection) To generate the values of a random variable 
X, distributed as p(x), which is defined in the (limited or unlimited) interval [a, b] 
and factored as p(x) = Cg(x)h(x), where C => 1,0 < g(x) < 1 and where h(x) is 
ap.d.f. with analytically invertible cumulative H (x), it is necessary: 


¢ To generate 0 < & <1. 

¢ To randomly generate a point x € [a, b] from the density h(x): x = H-'(&). 

¢ To calculate g(x). 

¢ To uniformly generate 0 < &) < 1. 

°« If & < g(x), then x is accepted; otherwise, the procedure restarts from the 
beginning. 


This algorithm can be easily kept in mind if we consider h(x) as the base density 
of events and g(x) as a “weight” function: a generated point x; = H ~1e) will 
be more important (“heavy’’) the closer g(x;) is to one. This condition is taken into 
account in the second generation, when the event is accepted only if 2 < g(x;). We 
finally note that if, in Eq. (8.56), we define the weight function as g(x) = p(x)/h 
and h(x) = 1/(b — a), one gets Algorithm 8.4, whereas if the weight function is 
g(x) = p(x)/h(x), one gets Algorithm 8.5. 


Exercise 8.5 
Generate a random variable within the interval [0, 7/2] with p.d-f.: 


P(x) =xsinx dx. (8.57) 


(continued) 


8.7 Rejection Method 341 


Exercise 8.5 (continued) 
Answer In this case we cannot apply the inverse cumulative method (algo- 
rithm 8.2) because the equation: 


x 
/ x sinx = sinx —xcosx =é (8.58) 
0 


is not analytically invertible for x. We therefore apply the rejection technique 
using the three procedures we have just derived. 


¢ Algorithm 8.4 
We delimit p(x) within the [0, 72] x 2/2 square, as in Fig. 8.9. 
Referring to Eq. (8.38), now we have a = 0;b = m/2;h = m2. 
Therefore, the variable is simulated with an extraction efficiency of about 
40% (¢ = 4/77), with the R code: 


pigd5 = 0.5*pi # pi as 3.14152) 1n the Rk isottware 
# basic rejection method 


xv <- seq(0,0,length.out=nsim) # nsim is the number of points 
for(k in l:nsim) { 

Gea, = il 

px = 0 


while(csi > px) { 
x = pig05 xrunif (1) 
px = x*Sin(x) 
esi = pig05+*runif (1) 
xvik] = x 
} 
} 
¢ Algorithm 8.5 
If we generate points uniformly distributed under the curve f(x) = 
x (see Fig.8.9), we double the generation efficiency of the previous 
algorithm as the ratio between the areas under p(x) and f(x) results: 


m/2 m/2 8 
€ =|! pods ff f(x)dx = (=) ~ 81% . (8.59) 
0 0 IT 


To implement this method, f(x) has to be normalized, calculating its area 
in the interval [0, 7/2]: 


x/2 7 
i ad — —— ae (8.60) 
0 


and then the equation: 


iL (=)xax ae (8.61) 


(continued) 


342 8 Monte Carlo Methods 


Exercise 8.5 (continued) 
must be solved. The result is: 


ve (5) 51. (8.62) 


In this way the random abscissa x; has been sampled, whereas the corre- 
sponding ordinate is obtained with a random uniform sampling between 
0 and f(x) = x, through the equation y = x&. Therefore, the variable 
generation loop is: 

# optimized rejection method 


xvl <- seq(0,0,length.out=nsim) 
for(k in 1:nsim) { 


egal = il 
px = 0 
while ( 


pigO5 «sqrt (runif (1) ) 
px = x*«sin(x) 

csi = x«*runif (1) 

xvi[k] = x 


} 


¢ Algorithm 8.6 
Based on Eq. (8.60), the factorization of p(x) is obtained as: 


p(x) = (=)\(=) sin x , (8.63) 


and the factors are identified as C = 27/8; g(x) = sinx and h(x) = 
(8/m)x. In this case the loop becomes: 


# weightedrejection method 
xXv2 <- seq(0,0,length.out=nsim) 
for(k in 1:nsim) { 
Gal, = al 
ps = 0 
while(csi > px) { 
x = pigdS5 «sqrt (runif (1) ) 
jope = Shalit (22) 
esi = rune (1) 
xv2[k] = x 


You can find the complete solution of the exercise and the generation of the 
distributions of the values xv, xv1, xv2 in our MCxsinx routine. 


8.8 Particular Random Generation Methods 343 


Fig. 8.9 Behaviour of the aN 
functions f(x) = x and Y 
p(x) = x sin(x) in the 
interval [0, 2/2] 
a 
p(x)= x sinx 
E i S 
0 70/2 x 


8.8 Particular Random Generation Methods 


In some cases none of the algorithms discussed so far can generate values of random 
variables in a simple or sufficiently rapid way. 

Fortunately, several well-established algorithms have been existing since a long 
time to efficiently solve “ad hoc” many special random generation problems. 

As an example, here we will show some of the methods used to generate 
Gaussian and Poisson density variables while, to have a complete review of all 
(or almost all) the different random generation algorithms, we suggest to consult 
the references [Fis96, Knu81, Mor84] and [Rub81]. Although these algorithms are 
already implemented inside the R routines that generate random numbers from all 
common distributions, we think it is equally instructive to take a look at these 
methods, to give you additional hints useful both to solve non-standard random 
generation problems and to review some important probabilistic and statistical 
concepts. 


(a) Gaussian random variates generation. 
As we have seen in Sect. 3.5, it is possible to obtain, from a Gaussian density 
g(x), with any yz and o, a standard Gaussian value or deviate: (with 4 = 0 and 
o=1): 


(8.64) 


coming from the standard Gaussian: 


f 
g(t) = ——e?/? (—00 < t < +00). (8.65) 
JU 


344 


8 Monte Carlo Methods 


If we were able to sample a deviate from this density, with the inverse 
transformation of Eq. (8.64): 


x=tot+yp, (8.66) 


we would obtain any other Gaussian variate. However, it is not possible to 
analytically derive the cumulative of g(t) and, consequently, use Algorithm 8.2. 

Given the infinite range of variation of t, Algorithm 8.4 cannot be used 
either, while Algorithms 8.5 (see Exercise 8.5), 8.6 and 8.3 are all applicable 
(see, e.g. [TC93]). However, other methods, simpler or more efficient, are 
usually preferred. One of these, based on the Central Limit Theorem, exploits 
the properties of the sum of uniform random variables and is described in 
Problem 8.7. Also our MCgauss1 routine can be examined. 

The procedure that is most frequently used for the Gaussian generation 
is the one devised at the end of the 1950s by the American mathematicians 
G.E.P. Box and M.E. Muller, who proved that, contrary to what one might 
intuitively assume, it is easier to generate not one but two independent Gaussian 
variables simultaneously. Let us consider the standard bivariate Gaussian in 
polar coordinates (r, g) of Eq. (4.81): 


g(r, g) dr dg = ae dr dy = p(r)q(y) dr dg , (8.67) 
JU 


and calculate the cumulative functions of the modulus and angle of the polar 
vector: 


& = P(r) = a p(r) dr =1—e""/) , (8.68) 
0 
is ”) 
b = 0) =| godp= 2. (8.69) 


Since both p(r) and q(¢) have analytically invertible cumulative functions, by 
applying Algorithm 8.2, one easily gets: 


| tsa 210861 (8.70) 


Finally, going to the Cartesian coordinates (z1, z2): 


Z1 =rcosg = /—2 log & cos(27é2) (8.71) 
zo =rsing = /—2log&é sin(27é) , , 


the variates of a pair of independent standard Gaussian variables are obtained. 
Notice that just two random numbers &; and &2 have been used now. 


8.8 Particular Random Generation Methods 345 


To speed up the algorithm, an ingenious expedient, described in [Knu8 1], 
can be used: if we randomly generate a point of Cartesian coordinates (v1, v2) 
inside the unit circle centred on the origin, the sum s = vt + 0, is a random 
uniform variate between 0 and 1 (we leave the simple proof of this statement 
as an exercise). We can use this number instead of &1, while the angle defined 
by this point and by the abscissa axis represents the random angle 27. In 
this way the direct calculation of the trigonometric functions is avoided since 
the cosine and the sine appearing in the Eqs. (8.71) are calculated through the 
ratios v;/./s and v2/./s. This advantage is partially counterbalanced by the 
disadvantage to use the rejection technique to obtain the coordinates (vj, v2), 
since we must first generate two numbers v1, v2 uniformly distributed in the 
interval [—1, 1] and then accept only the pairs satisfying the condition: vi + 
v5 < 1. However, the efficiency of this operation (~ 78%, equal to the ratio 
between the unit circle and the square circumscribed about it) is quite high, 
and the latter procedure resulted on our computer about 20% faster than the 
“classical” method described by Eqs. (8.71). The Gaussian generation algorithm 
can be summarized as follows: 


Algorithm 8.7 (Gaussian Generation) To generate two independent normal- 
ized Gaussian variates Z1, Z2 it is necessary: 


¢ To generate two independent uniform variates 0 < &, &2 < 1. 

¢ To define vy = 2& — 1; v2 = 2&2 — 1 and to calculate the sum s = u; + ve 
e Ifs > 1, the procedure is repeated from the beginning. 

¢ Ifs <1, the events {Z, = z1}, {Z2 = 22} are generated as: 


—2logs 
v1,, ——— 
a 8.72 
—2logs 
22 = U2,/ ——— . 
Ss 


This algorithm is implemented in the MGgauss routine, here reported, 
which, as a result, gives the histogram of Fig. 8.10 (upon request) and the two 
independent Gaussian variates g1 and g2. 


Z1 


MCgauss<- function (nsim=1000,mu=0, sigma=1, plot=TRUE, grid=TRUE) { 
index <- seq(1,nsim,by=2) 
for(j in index) { 


s=2 

while (s>1) { 
vl = 2.x*runif(1) - 1 
v2 = 2,*runif(1) = 1 


s = v1*2 + v2*2 


ls = sqrt (-2*log(s)/s) 
Zl = Vlels 
Z2 = v2«xls 
oi] = mu + sigmaxzl 


346 8 Monte Carlo Methods 


1600 


1400 


12007 
1000-5 
800 7 
600 5 


400 


200 5 


-5 


Fig. 8.10 Histogram of a sample of 20,000 random numbers from the Gauss2 routine. The 
continuous curve represents the standard Gaussian 


gl[j+1] = mu + sigmaxz2 


} 


# plot results 
if (plot==TRUE && nsim>2) { 
# $x.val is the binning of the histogram 
xplot <- HistoBar(g,nbins=30,errors=’ON’, 
xex=' ',yex=’ ',out=TRUE) $x.val 
dx = (xplot [2]-xplot[1]) 
yplot <- nsims*dnorm(xplot,mean=mu,sd=sigma) «dx 
lines (xplot,yplot,type='’1’) 
if (grid==TRUE) grid{) 
} 
rval = list (gl=g[1],g2=g[2]) 
return (rval) 


} 


(b) Generation of Poissonian variates. 
Each stochastic process in which discrete and independent variables are 
generated with probability 4 constant over time is characterized by two 
fundamental properties: 


e The time interval between two consecutive events is a random variable with 
exponential density. 

¢ The number of events that occur within a given time interval of a predeter- 
mined length At is a Poissonian variable of mean uw = AAT. 


In Exercise 3.12, we have seen that the time t between two consecutive 
stochastic events can be simulated with the equation: 


8.8 Particular Random Generation Methods 347 


1 
t=-——logé. (8.73) 
bb 


To obtain the number x of events occurring within At = 1, it is then sufficient 
to generate a sequence To, T],..., Tx, Tx+1 Of time intervals (with t) = 0) until 
the inequality: 


x+1 


a 
Ye2i12 5 5 ee (8.74) 
i=0 i=0 


is verified. Taking into account Eq. (8.73), this inequality can be rewritten as: 


x x+1 
—)“logg < p< — > log x=0,1,... (8.75) 
i=0 i=0 
or, without logarithms: 
x x+1 
[[&2e“ > T]& x=0,1,... (8.76) 
i=0 i=0 


In conclusion, we arrived to the following: 


Algorithm 8.8 (Poissonian Generation) To generate a random Poissonian 
variate, it is necessary: 


¢ To defnek =0;5 =1. 

¢ To generate a uniform variate 0 < & <1. 

¢ To sets =s &. 

°¢ Ifs < e™, then set x = k; otherwise, set k = k + 1 and go back to the 
second Step. 


This algorithm is implemented in our MCpoiss routine. It is easy to deduce 
that the computation time of this routine increases proportionally to jz; in fact, 
we have found that the execution times of MCpoiss are much longer than 
those of the R routine rpois, which uses a limiting formula of the binomial 
distribution. 

A useful exercise with the rpois routine is to check Table 6.2 of Sect. 6.6. 
By setting 44 = 2.3 and generating larger and larger samples, you will observe 
that the number of times in which zero events are drawn will tend to 10%, that 
setting 42 = 4.74 the number of successes x < 1 will tend to 5% and so on. 


To simulate random variables having distributions not considered here, it is often 
sufficient to resort to the definitions and theorems of probability theory. As an 
example, to simulate a Q(v) ~ x2(v) variable, just remember Theorem 3.3 and add 


348 8 Monte Carlo Methods 


the squares of v standard Gaussian variables. By following the standard definitions, 
correct algorithms are certainly obtained, although not always particularly fast and 
efficient. 


8.9 Monte Carlo Analysis of Distributions 


One of the most interesting and astonishing applications of the MC methods is 
probably the determination of whatever complicated statistical distributions. As 
we have seen in Chap.5, the study of the distribution of random variables that 
are functions of other random variables requires rather laborious mathematical 
techniques, so much so that the analytic solution, in most cases, may only be within 
the reach of skilled mathematicians or not even be obtainable in practice. Monte 
Carlo methods have revolutionized this branch of applied statistics, as they provide, 
using elementary procedures, the required solution, even if in the approximate 
form of histograms and not in an appropriate analytical form. From the simulated 
histogram, it is then possible to evaluate all the requested parameters (such as mean, 
variance, area under the tails) with negligible statistical errors in the context of the 
problem under study. 

The procedure is sketched in Fig.8.11: the random variates of the variables 
(X, X2, X3, ...) are generated according to their densities, and, at each generation, 
the variate of the variable Z = /f(X,, X2, X3,...) is calculated. After this 
algorithm has been repeated a sufficient number of times, at the end of the generation 
loop, the result is usually presented as a histogram of Z, whose binning can be 
varied at will to be consistent with the problem under study. Finally, the fundamental 
statistical quantities characterizing the distribution are calculated from this artificial 
histogram. It is also possible to perform the histogram best fit (with the approaches 
described in the next Sect. 10.7) using empirical functions such as exponentials, 


y J\ = f(x,y,z) —>+ 


Fig. 8.11 Simulation of a distribution with MC methods 


Z 


8.9 Monte Carlo Analysis of Distributions 349 


polynomials and sums of Gaussians, thus obtaining an analytic form, independent 
of the histogram bin, that interpolates the true solution with the desired degree of 
accuracy. The use of the computer has made this method so simple that it can be 
fully illustrated with an elementary example. 

Suppose one wants to determine the distribution of the random variable: 


sin X 
Z=—, (8.77) 
sin Y 
where X and Y are Gaussian angles having mean and standard deviation equal to: 


6, = 20°, o, = 3°, 


8.78 
Oy = 13°, oy =3°. oo 


By using the approximated Eqs. (5.63, 5.66) of Sect. 5.4, the mean and variance of 
the unknown distribution can be estimated as: 


sin 6, 


(Z) = — = 1.52:, (8.79) 
sin Oy 
2 1/2 
7 cos6,\7 / m \2 2 sin x COs Oy a \2 » 0.41 
olZ] = (SF) (im) 2+ ~ gin? By (Teo) | = O41. 
(8.80) 


where the variance has been converted in radians, since z must be expressed in the 
decimal system. As discussed in Sect. 5.4, both these two formulae are approximate: 
Eq. (8.79) holds only if the relationship between Z(X, Y) is linear, while Eq. (8.80) 
requires both the validity of the linear dependency and of small percentage errors. 

Since all these conditions are drastically violated by Eqs. (8.77, 8.78), in this case 
we are not at all certain of either the validity of Eqs. (8.79, 8.80) or of the distribution 
of the Z variable. Indeed, our knowledge of probability theory allows us to assume 
a non-Gaussian distribution for Z. Let us see how the simulation is able to easily 
solve all these issues. 

Using the MCrefrac routine given below, we generate 20000 Gaussian 
pairs (8.78) which are after combined using Eq. (8.77): the result is the histogram 
of Fig. 8.12a. 


MCrefrac<- function (nsim=20000, thetal=20, theta2=13,errthetas3) { 

rad = 180/pi # degrees of a 1 radian angle 
#angles in radiants 
angl = thetal/rad; ang2 = theta2/rad; errang = errtheta/rad; 
for(k in 1:nsim) { 

t=100 

while(t>4.5){ # truncation at t=4.5 to avoid a long tail 

# of negligible values 


350 8 Monte Carlo Methods 


1750}. 0(Z) m=1.61 1750) 


s=0.49 


1500 1500 


1250 1250 


1000 1000 


2 1.3 1.4 1.5 1.6 1.7 18 7 19 


Fig. 8.12 Histogram of 20,000 variates of Z from Eq.(8.77), where X and Y are Gaussian 
variables given by (a) from Eqs. (8.78); (b) from Eq. (8.81) 


rnorm(1,mean=angl,sd=errang) 
rnorm(1,mean=ang2,sd=errang) 
sin(X) /sin(Y) 


Qreak x 
il} 


} 


HistoBar (g,errors='ON’ ,nbins=20,grid=TRUE) 


} 


As can be easily noticed, the parameters jz and o,, deduced from the histogram: 
(Z) ~ 1.61, o[Z] ~ 0.49, 


have values rather different from the predictions of Eqs. (8.79, 8.80). Their statistical 
error, according to the fundamental formulae of Table 6.3, is of the order of 0.002-— 
0.003 and can therefore be neglected in this context. In any case, it can be reduced 
at will simply by increasing the number of simulated variables. This simulation also 
shows that the shape of the density deviates significantly from the normal curve. For 
instance, it can be easily checked that the number of events in the interval: 


btox~mts = 1.61 +£0.49 = [1.12, 2.10] 


is equal to 14 895, corresponding to a percentage of 75%. 

It is then clear that, apart from the lack of knowledge of the exact analytical form 
of the histogram density of Fig. 8.12a, all other information is easily accessible with 
the simulation. It is also interesting to decrease the dispersion of X and Y and see 
what happens. Intuition tells us that we should eliminate the long tail of density, 
which is due to the non-linearity of the sine function. We then operate as before, but 
replacing the conditions of Eq. (8.78) with: 


8.10 Evaluation of Confidence Intervals 351 


0, = 20°, o, = 0.5°, 


8.81 
Oy = 13°. oy =0.5°. =e 


Equation (8.79) remains unchanged, whereas Eq. (8.80) in this case gives the value: 


o[Z] = 0.068 . 


The simulation result is shown in Fig.8.12b. Now the density is “almost” nor- 
mal, with mean and standard deviation very close to the approximations of 
Eqs. (8.79, 8.80). Within the interval m + s, 13 723 events are found, corresponding 
to a percentage of 68.6%, a value in perfect agreement with 3o law. 


8.10 Evaluation of Confidence Intervals 


MC techniques also play a fundamental role in the determination of confidence 
intervals, because they allow us to solve the integrals (6.7) in an approximate way 
even when it is difficult or impossible to find the functional form of the density 
pz(z; @), where Z = f(X,, X2,...) is arandom variable that is a function of other 
primary variables of known distributions. The parameters @ are usually related to 
the distributions of the variables X1, X2,.... 

Assuming we need to estimate a parameter 6 € ©, the method follows 
Definition 6.1: if zmeas is the measured value (often an estimator of 9 obtained from 
a sample of size n), for each value of 8, with the methods described in the previous 
paragraph, a histogram representative of the density pz(z; 0) must be obtained. The 
conditions defined by the integrals (6.7) are approximated by counting, for each 
generated histogram, the fraction f (frequency) of events for which z < Zmeas. The 
extremes of the confidence interval [6;, 62] are found when: 


[ee 


L=-S pz(zAjdz=a, f a | pz(z; 0) dz = cp. (8.82) 
—oo 


Zmeas 


If the parameter is continuous, a discrete grid is considered, choosing a step AO 
small enough to meet the required precision. 

In Eq. (8.82), the equality is replaced by the symbol ~, which means values 
within the statistical error. If the required precision is defined in term of standard 
deviation, from Eq. (6.27), it follows that a number 7 of observations must be 
simulated until the errors: 


oe ye (8.83) 
n n 


assume the requested value. 
Let’s suppose, for instance, that one needs to determine a standard symmetric 
confidence interval with CL = 0.683; in this case (1 — CL)/2 = c, = cp = 0.158. 


352 8 Monte Carlo Methods 


If we generate a histogram with 100 000 events, about 15 800 events will fall under 
the tails, and, from Eq. (8.83), the statistical error of f would be about 0.001. 

When the density pz(z, 0) depends on a set of parameters 8 € ©, one builds a 
multidimensional grid for all parameters and records the sets of values for which, as 
in Eqs. (8.82), the chosen confidence levels are satisfied. 

The method based on Eqs. (8.82), known as the grid method, is fully general 
but often rather cumbersome. However, in some cases it is possible, when the Z 
density is invariant with respect to the parameters @, to proceed much easier and 
more directly. 

Let us first recall the results obtained in Sect. 6.2. If Eq. (6.10) holds, the second 
of Eqs. (8.82) becomes: 


Zmeas 
Q2= / P(Z; 62) dz = F (Zmeas; 92) = 1 — F'(62; Zmeas) - 
—oo 
This equation, when F is invertible, can be solved with respect to 62: 
62 = F~'(1 — ca; Zmeas) « (8.84) 
Analogously, for 6; one obtains: 
0) = F7'(c1; Zmeas) - (8.85) 
In other words, we need to evaluate the quantiles of order 1 — cz and c; of the F 
distribution with parameter Zmeas. When F is not analytically invertible, 6; and 62 
can be easily obtained by simulating a histogram of this distribution. 


Equations (6.9, 6.10) hold in the Gaussian case. Therefore, if Z has mean @ (as 
in the case of an estimator of the mean), Eq. (6.10) becomes: 


F (Zmeas — 9) = 1 — F(0 — Zmeas) 5 


where F is the cumulative function of a zero mean Gaussian, which is symmetric 
around zero. If CL = 1 —a@ and cj = cz = a/2, Eqs. (8.84) and (8.85) give: 


01 — Zmeas = F'(a/2), 02 — Zmeas = F~'(1 — 02/2), 
that is: 
01 = Zmas + ta/2, 92 = Zmeas + t—a/2 
(see Fig. 8.13, with uw = 6; and x = Zmeas). Basically, to solve the problem with the 
MC technique, we focus on the measured value and find the location of 62 from the 


density tail, simulating a sample with a Gaussian density g(Z; Zmeas) and evaluating 
62 through the histogram. We proceed in a similar way for 6). 


8.10 Evaluation of Confidence Intervals 353 


grid 


ps u,o) 


bootstrap 


PCH; x, 8) 


Fig. 8.13 Determination of the confidence interval, when the property (6.10) holds, that is in the 
case of symmetry and shape invariance of the density with respect to parameters under study (here 
4). With the grid method (upper curve), the value of jz is found when the shaded area matches 
with the assigned CL (usually 1 — CL 0 (1 — CL)/2). With the bootstrap method (lower curve), 
a histogram from the density which has as parameter the measured value x is obtained. Then, the 
confidence interval of jz is found as if it were a probability interval. The shaded area under the tail 
is the same 


This procedure (see [Buc84, DH99, Rip86]), which replaces the true parameters 
with the estimated ones, is part of a class of MC algorithms known as “bootstrap” 
which we will in more detail in Sect. 8.12. 

Let us now apply these general principles to the concrete case of Eq. (8.77). If 
Eqs. (8.78) refer to a measurement, the true quantities 6, and 6, must be replaced 
with the experimental values 4; and ty. We will suppose that the quantity z = 
Nmeas = (Sinty/sinty) = 1.52 is the value of the refractive index n obtained 
measuring the directions of the light rays with an optical goniometer that has a 
resolution of 3° or of +0.5°. The result, as is usually the case for laboratory 
measurements, must be given at a confidence level of 68.3% centred on the 
measured value tmeas = 1.52. Let us consider, to begin with, the case of large errors 
(o = 3°). To apply the grid method, since the measured angles of the Gaussian 
variables are independent, it is necessary to calculate the distribution of sin x / sin y 
where X and Y are Gaussian variables with means 6, and 6 and standard deviation 
equal to the measurement error of 3° and determine the two refractive indices for 
which the measured value n = 1.52 is respectively the quantile of value w and 1 —a, 
with a = 0.1585. The calculation must be performed for all possible values of 6, 
and 6y. The result, obtained with the MCgrid routine and also shown in Fig. 8.14, 
gives the interval: 


n € [1.10, 1.95] = 1.521022, CL 68.3%. (8.86) 


354 8 Monte Carlo Methods 


Fig. 8.14 Simulated samples 
from the density p(n; 6x, Oy) 
for the values of 6, and 0) 
giving the upper and lower 
limits of the interval (8.86). 
The vertical line represents 
the measured value 


refraction index n 


Now let us apply the bootstrap method: in Eqs. (8.84, 8.85), the values 0, and 6, 
of Eq. (8.78) are replaced by the measured angles t, and fy, and the histogram of 
Z = sin X/sinY is generated with X and Y sampled from the Gaussian densities 
g(x; ty, ox) and g(y; ty, oy). This histogram is nothing but the one already displayed 
in Fig. 8.12a that should be considered as sampled from a population of density 
p(n; ty, ty). Using 100 000 simulated events, with the Boot grid routine , 15 850 
events have been obtained inside the intervals (—oo, 1.17] and [2.03, ++oo): 


n € (1.17, 2.03] = 1.527035 , CL ~ 68.3%. (8.87) 


This result is different from the correct one given by Eq. (8.86). The reason for the 
discrepancy is the strong asymmetry of the distribution, as seen from Fig. 8.14. 

In the situation shown in Fig.8.12b, corresponding to an error of -£0.5°, the 
measurement density assumes an invariant form within the considered angular range 
and the confidence interval becomes symmetrical around the measured value. In 
this case, both the methods previously described, grid and bootstrap, give the same 
result: 


n € [1.45, 1.59] = 1.52+0.07, CL ~ 68.3% , 


where the uncertainty of the interval extremes, which is of some part per thousand, is 
neglected. The simulation shows that, in order to obtain a reliable measurement, it is 
advisable to have measurement errors of the order of half a degree. This is the typical 
accuracy of ordinary optical goniometers. In Sect. 12.9 we will apply simulation 
techniques to the extremely important case of the propagation of measurement 
errors. 


8.11 Simulation of Counting Experiments 355 
8.11 Simulation of Counting Experiments 


In Sects. 6.6 and 6.8, we have introduced the confidence intervals for the estimation 
of probabilities and frequencies. Here we want to explore, using simulation, the 
statistical coverage properties of these intervals. Over the past 10 years, these results 
have changed the approximate formulae to be used in counting experiments that are 
presented in many statistics books. 

Given a measured frequency f = x/n, obtained from x successes in n trials, 
the interval containing the true value of the probability with a confidence level CL 
can be evaluated with Eqs. (6.32, 6.18, 6.19, 6.37) that we report again here for 
convenience: 


¢ The Clopper-Pearson general formula for the binomial case, which is the 
frequentist estimate p € [p1, p2], where CL = 1 — cc, — cz and pj, po are 
the solutions of the equations: 


n 


(7) oa - pot =a, (6.18) 


k=x 
x 
n = 
(7) ofc = pa" ae (6.19) 
k=0 
¢ The Gaussian approximation with continuity correction, where f+ = (x+0.5)/n 
gives the Wilson interval: 
9) 2 
t ty fs 
fet —< [2/2 a 
pe 7 a 7 (6.37) 
| ae Al 
n n 


¢ The large sample approximation, which gives the well-known Wald interval, 
presented in any elementary statistics handbook: 


1— 
Pe f= |ta2\ JU) (6.32) 


n 


Given a number of attempts and a probability p, the simulation code MCbinocov 
(given below) randomly extracts a number of successes x with the rbinom routine, 
calculates the confidence intervals with the formulae just considered and counts 
how many times there is a success, that is, when the true value p is included in the 
interval. 


356 8 Monte Carlo Methods 


Lai ad a talk’ cach he ta ae ms 

098 Bib tt | 

; pe EE al 
a 096 Mbps aa 
2 1) Pd rr tee || |t 
* 094 4-H My MAL 
(92; poaead ih ilk 
0.9 Pe 

00. — aa — To 

probability 


Fig. 8.15 Coverage curves for the binomial distribution, with n = 30 and CL = 0.90, for the 
frequentist (Clopper-Pearson) confidence intervals (6.18, 6.19) (right box), for the Wilson interval 
(6.37) (full curve in the left box) and for the approximated (Wald) interval (6.32) (dashed curve in 
the left box). The Clopper-Pearson and the Wilson formulae provide equivalent coverages 


The coverage curve is obtained by repeating the procedure 10,000 times for 
each value of p and plotting, with the plot routine, the fraction of inclusions. 
The values p; and p2 of the frequentist formulae (6.18, 6.19) are obtained 
by inversion of the cumulative of the binomial distribution from the R routine 
binom. test. The numerical methods performing this inversion are also described 
in [ZeaPDG20, PFT W92]. For the limiting cases x = 0 and x = n, Eqs. (6.22, 6.23) 
are used. 

The coverage curves are shown in Fig. 8.15 for a binomial distribution with n = 
30 for intervals with CL = 0.90, which correspond to a Gaussian quantile |fq/2| = 
t—o/2 = 1.645, where a = 1—CL. The structure of the curves appears irregular due 
to the discrete value of the variable examined, namely, the number x of successes. 

The result, quite surprising, shows how the approximate formula (6.32) (dashed 
curve in the left box of Fig. 8.15) provides an absolutely unsatisfactory coverage, 
almost always well below the assigned confidence level. Contrary to usual practice, 
this formula should therefore only be used for quite large samples, at least with 
n > 300 [Rot10]. The frequentist interval of Eqs. (6.18, 6.19), shown in the right 
box, despite the irregular behaviour of the curve, provides a correct over-coverage, 
with values always above the chosen confidence level. As noticed before, this over- 
coverage is simply due to the presence of the x value in both sums (6.18, 6.19). 

The simulated results also clearly show the difference between confidence 
level and coverage in the case of discrete variables. Another interesting result is 
that the Wilson interval with the continuity correction given by Eq. (6.37), which 


8.11 


Simulation of Counting Experiments 357 


can be easily calculated, provides a coverage equivalent to that of the correct 
frequentist formulae, which, on the contrary, require the use of statistical software 
or ad hoc programs to be calculated. In conclusion, the simulation shows that the 
use of Eq. (6.37) should be much more widespread than it is now, because this 
formula, unlike Eq. (6.32), provides reliable confidence intervals already for n > 10 
[BCDO1, Rot10]. Here you find the routine used for the previous tests. We suggest 
you to try it with different input parameters for a comprehensive check of all these 
approaches. 


MCbinocov (nsim,N,conf,grid,scale,wald,wilson,clopper) : 
check of the coverage of the 

Clopper-Pearson, Wilson and Wald formulae 

INPUT: 

nsim = number of simulated events 

N = number of trial of the binomial 

conf = [0,1] confidence level 

grid = when TRUE a grid is made on the plots 

scale = scalexconf is the lower limit of the final plots 

wald, wilson, clopper = when TRUE the plots are drawn 
OUTPUT: plots of the coverages 


MCbinocov<- function (nsim=1000,npts=100,N=20,conf=0.68,grid=TRUE, 


scale=0.8,wald=TRUE, wilson=TRUE, clopper=TRUE) { 


t = qnorm(0.5x* (1+conf) ) 


Pp 


<- seq(0.0,1.,length.out=npts) 


cov <- seq(0.,0.,length.out=npts) 


waldcov <- seq(0.,0.,length.out=npts) 
wilscov <- seq(0.,0.,length.out=npts) 
for(j in 1:npts) { # points of the plots 


for(k in 1:nsim){ # event simulated at each point 


x = rbinom(1,size=N,prob=p [Jj] ) 

f= x/N 

fp=min(1, (x+0.5) /N) 

£m=max (0, (x-0.5))/N 

bl = binom.test (x,N,conf.level=conf)$conf.int[1] # Clopper 
b2 = binom.test (x,N,conf.level=conf) $conf.int [2] 

wl = £ - tesqrt(f*(1-f£)/N) # Wald 

w2 = £ + txesqrt (f* (1-f) /N) 


wwl = (fm+(t*2/(2*N))- 

txsqrt (£m« (1-fm)/N + t*2/(4«N*2))) /(14+t*2/N) 
ww2 = (fp+(t*2/(2*N))+ 

txsqrt (fp* (1-fp)/N + t*2/(4«N*2)))/(1+t*2/N) 
wwl = max(0.,wwl) # wilson 
ww2 = min(1.,ww2) 
if(b1 <= p[j] && p[j] <= b2) cov[j]=cov[j]+1 


wl <= p[j] && plj] <= w2) waldcov[j] =waldcov[j]+1 
if(wwl <= p[j] && p[j] <= ww2) wilscov[j]=wilscov[j]+1 
} 
} 
cov <- cov/nsim # coverages 


waldcov <- waldcov/nsim 


wilscov <- wilscov/nsim 
if (clopper==TRUE) { #plots 


title = paste(" Clopper ---- (red) Wald --- (blue) Wilson") 


358 8 Monte Carlo Methods 


plot (p,cov,type='1',lwd=2,ylim=c(scalexconf,1),main=title) 
if (wald==TRUE) lines (p,waldcov, lty=2,1lwd=2,col='red’ ) 
if (wilson==TRUE) lines (p,wilscov, lty=2,lwd=2,col='blue’ ) 
} 
else if (clopper==FALSE && wilson==TRUE) { 
title = paste(" ----(red) Wald Wilson") 
plot (p,wilscov,type='1',1lwd=2,ylim=c(scalexconf,1),main=title) 
if (wald==TRUE) lines (p,waldcov, lty=2,1lwd=2,col=’red’ ) 
} 
else{ 
title = paste(" Wald ") 
plot (p,waldcov,type='1',1lwd=2,ylim=c(scalexconf,1),main=title) 


} 
if (grid==TRUE) grid() 
abline (h=conf,col=’ red’ ) 


} 


If we need to estimate a counting frequency jz starting from a measured count 
x, the equations to use are (6.41, 6.44, 6.45), that, for convenience, we write again 
here: 


e The general formula for the Poisson case is the frequentist estimate of the interval 
lu € [41, 42], where the values j21, {42 are the solutions of the equations: 


ee) uk 7 ie 
> "i exp(—H1) = C1 , » a exp(—2) =c2. (6.41) 
k=% k=0 

¢ Using the Gaussian approximation with continuity correction (x- = x + 0.5), 


one obtains: 


fhe x4 (6.44) 
¢ When x is large, Eq. (6.44) can be replaced by: 
MEX |fa/21/x . (6.45) 


The coverage curves for these three intervals, for 4 < 30, are shown in Fig. 8.16. 
These plots have been obtained with our MCpoisscov routine, not reported here, 
since its structure is similar to MCbinocov that has been detailed just above. 

As in the binomial case, we also notice here that the coverage given by the 
approximate formula (6.45) is absolutely unsatisfactory. The frequentist formulae 
and the one under Gaussian approximation with continuity correction are practically 
equivalent and give a good over-coverage (full curves in Fig. 8.16). Equation (6.44) 
gives a result more appropriate than the approximate formula (6.45), which should 
only be used when the average exceeds a few dozen events [Rot10]. 


8.12 Non-parametric Bootstrap 359 


1.0 il I i I i] I il I i il I i 

“ec. # @ 4 he |e ieee ee Sete le ore ene 

ene eae ee | || area ee ene 

AVAL | 4.96 || tee 

0.9 CoC Gran tk TT pt 

Tee PPC ET |g 17d ie eeeenes eee ener ee 
VT i ee ed | re ee 

Sith he soc See |||) oy ener eens 
1 a a ao | | We oh 

pede) ee eee Gees ee oe 2 —a Bl | y a i - | is =T 
Sogo IN a ag 
a ie es ae a ae Pk ea ae 
a a ee re 0.9 oe ey 2 

0.7 [h--t----e aaah anda nant a oa oe or 

05 10 15 20 25 30 0 5 10 15 20 25 30 

event mean number event mean number 


Fig. 8.16 Coverage curves, for the Poisson density with CL = 0.90, for the frequentist confidence 
intervals (6.41, 6.19) (right box), for the interval of Eq. (6.44) (full curve in the left box) and for the 
approximated interval (6.45) (dashed curve in the left box). The frequentist formulae and Eq. (6.44) 
give equivalent coverages 


8.12 Non-parametric Bootstrap 


In the previous sections, we have seen how to numerically derive the properties of 
a stochastic variable through random samples obtained with parametric probability 
density models. Often, if no information is available, the estimated values instead 
of the true ones are assigned to the model parameters (means, variances or other). 
This method, known as parametric bootstrap, can be further generalized to those 
problems where there is no information on the probability density of the random 
variable to be examined [DH99, DE83, Efr79, Efr82, ET93, PFTW92]. To fully 
understand this non-parametric bootstrap technique, let us generate a sample of 100 
standard Gaussian variables and calculate their mean: 


> X <- rnorm(100) 
> mean (x) 
> 0.01365425 


We now proceed to the bootstrapping of the sample, generating N new sam- 
ples of the same dimension of the original one (n = 100). We sample, with 
replacement, the elements of the original set of values. In R this operation 
can be performed by many routines, sample among others, through the call 
sample (x, size=n,replace=TRUE). This routine is also often used to per- 
mute elements of a vector, with the call sample(x, size=length (x) ), 
when the replace parameter is set by default at the value FALSE. It should 


360 8 Monte Carlo Methods 


be immediately noticed that the bootstrap samples thus generated differ from the 
original one due to the presence of repeated elements. Actually, this apparently 
trivial fact is the core of the method. 

We now proceed by generating N = 1000 bootstrap samples, loading their means 
in a vector boot, with the in line statements: 


> boot <- seq(0,0,length.out=1000) 
> for(j in 1:1000) boot[j] = mean(sample(x,size=100,replace=TRUE) ) 


Now let us calculate the mean and standard deviation of the sample of the means: 


> mean (boot) 
> 0.01502075 
> sqrt (var (boot) ) 
> 0.1005019 

The first result coincides, within the statistical error, with the mean mean 
(x) of the initial set of values, while the second one represents the surprise: it 
corresponds to the statistical error of the sample mean given by Eq. (6.50) as it can 
be easily check with the R command var (x) /sqrt (99) .In other words, without 
applying any statistical theory, the standard deviation of the means of the bootstrap 
samples gives the standard deviation of the mean of x. This example indicates two 
essential aspects of the non-parametric bootstrap: 


1. The mean values of the bootstrap samples usually do not give new information, 
because they are distributed around the initial experimental values. However, 
significant differences, larger than the statistical error, can sometimes occur 
between the original and the bootstrap mean. In this case, the difference is called 
bias and can be used to correct the confidence interval, as will be discussed later. 

2. The dispersion of the bootstrap samples is an estimate of the dispersion of the 
parameter under consideration. Thus, bootstrap is very useful, for example, when 
studying the variances of some complicated quantities. 


Let us now check the reliability of the method in the more difficult case of the 


variance. We generate a sample of n = 100 Gaussian variates with ~ = O and 
o=2: 

> y <- rnorm(100,sd=2) 

> var (y) 

> 4.761325 


As previously mentioned, both in this and in the other similar cases, if you repeat 
the exercise, you would get slightly different values due to the statistical fluctuations 
present in the simulated data. Let us now generate a bootstrap sample of variances: 


> bootv <- seq(0,0,length.out=1000) 
> for(j in 1:1000) bootv[j] = var(sample(y,size=100,replace=TRUE) ) 


From this sample we find the two quantile values go.158, go.g41, corresponding to 
a confidence interval with CL = 0.683 (equal to 1o in the Gaussian case) using the 
R routine quantile: 


8.12 Non-parametric Bootstrap 361 


mean (boot v) 

4.719306 

quantile (bootv,c(0.158,0.841) ,names=FALSE 
4.1585526 5.259678 


> 
> 
> 
> 


Again one has mean (bootv) ~ var (y) within the statistical error, while the 
quantile values are very close to those of the exact formula (6.76): 


> 99xvar(y) /qchisq(0.841,df=99) 
> 4.171343 

> 99xvar(y) /qchisq(0.158,df=99) 
> 5.54933 

Finally, let us go back to the Exercise 6.10 in which we determined a confidence 
interval for the true correlation coefficient of the chest/height data pairs given in 
Table 6.4, under the assumption that the bivariate probability density p(x, y) of 
the data pairs was given by the two-dimensional Gaussian function. Let us now 
try to determine a confidence interval for the chest/height correlation coefficient 
without assuming a Gaussian density for these two variables and, therefore, not 
using Fisher’s transformation. 

With the bootstrap method, it is assumed that the obtained sample represents 
the true probability density of the data from which random samples are generated 
to estimate the parameters to be determined. Then, correlated pairs (s;, t;) of 
chest/height values are extracted (with replacement) from the experimental sample 
until a new “virtual” sample of 1665 elements is obtained. Since only the histogram 
data of Table 6.4 are available, to obtain bootstrap samples, it is first necessary to 
generate an approximate sample of raw data, which maintains the original data set 
structure. This is done in our Boot Cor routine by duplicating the chest and height 
data a number of times equal to the content of each bin of the two-dimensional 
frequency histogram. For example, pairs of 88 cm chest and 166 cm height will be 
repeated 114 times and so on. From this sample of 1665 “original” data, bootstrap 
samples are then created to estimate of the correlation coefficient r%,, obtained 
through Eq. (6.117) with the same operations of Exercise 6.10. 

By repeating this operation for a sufficiently large number of times, we obtain 
a fairly large sample of coefficients r*,, which allows the determination of the 
confidence interval for the correlation coefficient. 

The histogram obtained with 10000 different simulated values of r* with our 
BootCor routine is shown in Fig. 8.17. From these data the following confidence 
interval is obtained: 


r*, © [0.221, 0.309] = 0.266173 (CL ~ 95.4%), 


which is exactly the same as that found in Exercise 6.10! 

After these examples, it is time to ask ourselves when and why bootstrap works. 
The generation of fictitious or artificial samples starting from the real sample (and 
having the same size n) is equivalent to replace the unknown probability density 
p(x) with a discrete probability distribution p*(x) with n components and to 


362 8 Monte Carlo Methods 


Fig. 8.17 Histogram of r¥ 4000 F 
obtained with 10,000 
bootstrap samples 
3000 + 
2000 
1000 + 
0 


assign the same probability 1/n to each observed value. From the artificial sample 


(ths ...,X}) obtained from this distribution, a summary value ¢* is obtained. By 
repeating this operation r times, a sample of values f/', t3, ..., t* is collected, which 
allows to estimate the property of the statistic T = t(X4,..., Xn), which encloses 


the information about the 6 parameter. 
Consider now the cumulative function F*(x) of p#(x), which assumes the values 


(0, 1/n,2/n,...,n/n). Ina formal way, it can be written as: 
#(xj < 
F@) = GL ST (8.88) 
n 


where the symbol # denotes the number of times the condition in bracket is verified. 
Using Eq. (8.88), it is easy to recognize that F* follows a binomial distribution with 
p = F(x), where F is the cumulative of the true, but unknown, density p(x). From 
these considerations, it follows that: 


E(F*(x)] =nF(x)=np, 


1 {= (8.89) 
Varl Fr (x)] = —{F@) LL — F@)} = lee 2 


Remembering the Central Limit Theorem, it is easy to derive that, when n 
increases, F’*(x) tends both to be Gaussian distributed and to better and better 
approximate the true cumulative F(x). The correctness of the bootstrap method 
relies on this simple property, which at first glance may seem almost miraculous! ! 


| Indeed, due to its outrageous simplicity, this method, after the well-known works of Efron [Efr79, 
Efr82], took some time to be adopted by statisticians. 


8.12 Non-parametric Bootstrap 363 


As we have just seen, in non-parametric bootstrap F(x) is replaced by F(x). 
The error introduced by this type of approximation is both due to the difference (f — 
6) between the true value 6 of the parameter and its correct statistical estimate 7, and 
to the difference (f—7*), where ¢* is evaluated from the simulated bootstrap sample. 
For the variance estimation, the method is based on the validity of the condition: 


fox (P\—?, (8.90) 


that is on the approximated similarity between the bootstrap data dispersion around 
their mean and the estimator dispersion around the true value. The method also 
allows us to verify a possible discrepancy between the average of the bootstrap data 
and the estimated value f, determined by the difference, called bias: 


7 —(F*) , (8.91) 


which may be due both to the finite size of the experimental and bootstrap samples 
(n and r, respectively) and to the particular estimator T used. For example, if T is 
not a linear function, it is usually not true that, asymptotically, 7* = f, and this leads 
to a systematic bias in the bootstrap estimate. 

The size n of the bootstrap samples is usually set equal to that of the initial 
experimental sample, while the number r of bootstrap samples (called replication), 
from which the parameter variance is estimated, is generally gradually increased 
until the statistical error becomes negligible for the problem under study, and the 
solution appears stable. Usually a number of replications 300 < r < 1000 is 
adequate for this purpose. The systematic error or bias can be corrected (see, for 
instance, [DH99]) but is independent of r. 

To ensure the validity of Eq. (8.90), it would be necessary to find pivotal estima- 
tors (that have the same distribution both with respect to F*(x) and to F(x)), but 
this is not always feasible. A list of the various methods that are successfully applied 
in these cases can be found in [DH99]. In conclusion, whenever the statistical 
properties of a certain parameter to be estimated from a sample are not known, 
the confidence interval can be estimated with the bootstrap method. However, keep 
in mind that you cannot know in advance what are the “bad” bootstrap samples, and 
this ultimately remains the main limitation of the method. These applicability issues 
are discussed in detail in [DH99]. We suggest, in order to check if the bootstrap 
method is applicable to your specific case, to test the procedure on simulated 
data, verifying how much the results obtained are compatible with the true values, 
which are known in this case. An example of this procedure can be found in 
our CovarTest routine, where we compare the variances of the covariance of 
simulated data calculated using bootstrap and Eq. (6.116). 


364 8 Monte Carlo Methods 


8.13 Hypothesis Test with Simulated Data 


As we have seen, for large samples and for Gaussian samples of any size, there 
are general methods for both estimations and hypothesis testing. On the contrary, 
the field of small non-Gaussian samples, for which it is not possible to formulate a 
general theory, remains still open. It is then not surprising that simulation techniques 
are generally used to solve this type of problems, even in the case of hypothesis 
testing. 

As an example, let us consider the t-test on means or pairs of values with 
the permutation test. Suppose we want to check the compatibility of the means 
mx and my of two samples x = (%1,%2,...,%n) and y = (y1, y2,.--, Ym) 
generated from an unknown parent population. We construct a vector z = 
(X1,X2,--+,Xns V1, Y2,---> Ym) Of dimension (n + m), simply obtained by joining 
the two initial vectors. Under the hypothesis Ho of equality of the true means, 
the sample z represents a homogeneous sample coming from common mean 
populations. At this point, the unknown distribution of the difference between 
the means is estimated with the following steps: 


(a) Permute the vector z. 

(b) Calculate the means of both the first n elements (z;,i = 1, ..2) and of the last 
m elements (zj,i = n+ 1,... +m) of the permuted vector z. The difference 
between these means is calculated, and its (absolute) value is recorded. 

(c) Repeat operations (a) and (b) R times to obtain a sample of the differences 
d* = |m, — m\|. This sample is assumed as the difference sample drawn from 
the population under Ho. 

(d) The p-value of the difference d = |m, — my| is estimated from the difference 
sample generated in (c) with the formula: 


_ #(d* > d) 


R , (8.92) 


where, as before, # is the number of times the condition in bracket is verified. 


This method is generally classified among the non-parametric bootstrap techniques, 
because the reference population is estimated using real data. Since the test 
is usually two-tailed, the absolute value of the differences in the algorithm is 
considered. 

We also note that in the permutation test the pure difference has been considered, 
without dividing it by the total standard deviation, because the method assumes that 
two original samples have comparable variances, so that they can be exchanged 
without affecting the results. Small differences can be tolerated by the method when 
the data are mixed during the permutations. 

Many R routines allow to perform the permutation test, one of them 
is twot.permutation, from the library DAAG. We wrote the routine 
Boot PermTest, whose core is the following: 


8.13 Hypothesis Test with Simulated Data 365 


nl = length (vec1) 
n2 = length (vec2) 


texp= abs (mean(vecl1) -mean(vec2)) # experimental value 
if (tmedian==TRUE) texp= abs (median (vecl1) -median(vec2) ) 
pool <- c(vec2,vecl) # global pooled vector under H_0 
for(j in 1:Nperm) { 
permut <- sample (pool) # permutation of the pooled vector 
if (median == TRUE) { 
pdiff[j] = abs (median (permut [1:n1] ) -median(permut [(n1+1) : (nl+n2)])) 
} 
else { 
pdiff[j] = abs (mean (permut [1:n1] ) -mean(permut [(n1+1) : (n1+n2)])) 
} 
} 
# p-value: sum the TRUE cases over the total 


pval = sum(pdiff > texp) /Nperm 


The routine receives in input the two raw data vectors vecl and vec2 and 
also offers the possibility to evaluate the difference between the medians, through 
the tmedian parameter. This possibility, due to the flexibility of the simulation 
methods, is often useful in the case of long-tailed distributions, since in this case 
the median is a more stable (robust) parameter than the mean. We also note that, to 
obtain correct results, it is important to calculate the two means from the same per- 
mutation. If you try to compare this routine with the t-test of t . test, you will see 
that the results are practically similar for large samples and for Gaussian or quasi- 
Gaussian samples. However, if we sample from a negative exponential distribution 
withvecl <- rexp(5,rate=1) andvec2 <- rexp(5,rate=7),wesee 
that the Student’s test of t . test fails because it gives p-values around 15%, while 
BootPermTest gives p-values around 4% for both the mean and the median, a 
clear hint of a possible systematic difference between samples. 

Sometimes, in hypothesis testing, simulated artificial data are needed instead of 
real ones. We explain this case by reconsidering Exercise 7.4, where we have to deal 
with small and binomial-distributed samples that are quite far from the Gaussian 
case. We can solve the problem by inserting the data into two-valued vectors x (sick 
rats) and n (all rats) and generating N results from two binomials having the same 
pooled probability fhat of Eq. (7.9). These steps are coded in the R instructions: 


fhat = sum(x) /sum(n) # pooled frequency 

bl <- rbinom(N,n[1],fhat) 

b2 <- rbinom(N,n[2],fhat) 

if (x[1]>x[2]) pval= sum(b1-b2>=x[1] -x[2] 

if (x[1]<=x[2] pval= sum(b1-b2<=x[1] -x[2] 
which are part of our MCDiffProp routine. As can be seen, this routine 

counts the events where the data generated from the binomials have a difference 

greater than the observed difference (x[1]-x[2]). The p-values obtained 

with this method, for example, with the call MCDiffProp(x=c (4,19), 

n=c (817,1631) ) (see the table in Exercise 7.4), are around 7% for both the 


366 8 Monte Carlo Methods 


considered tumours, a value significantly higher than those obtained before with the 
Gaussian approximation in Exercise 7.4. 

We think that the previous examples clearly explain this technique and will 
enable you to apply it in other similar cases. 


8.14 Problems 


8.1 Simulate the behaviour of a player who bets on lottery numbers that have been 
delayed for more than 90 weeks. Does this strategy increase the chance of winning? 


8.2 Solve Monty Problem 1.1 with a simulation code. 
8.3 Solve the encounter problem 1.7 with a simulation code. 


8.4 Write a simulation code that calculates the probability that the distance between 
two randomly drawn points uniformly within a circle is less than the radius. 


8.5 Generate, with Algorithm 8.4, random variates x following the so-called half- 


normal probability: 
[Qi 3 
p(x) = fe? Vx >0 
a 


by using the function f(x) = ke~*. Find the k value that gives the highest 
generation efficiency. With the value of x thus obtained, generate a variable Z ~ 
N(0, 1). 


8.6 Determine the confidence levels of Eq. (6.31) forn = 5, p = 0.25 andt = 
1, 2, 3 with a simulation code. 


8.7 Write an algorithm for the generation of Gaussian deviates using the Central 
Limit Theorem. 


8.8 Generate a pair of standard normal variables X and Y having linear correlation 
coefficient p. 


8.9 How can the rejection algorithm be used to randomly generate points uniformly 
distributed inside a circle of radius R? 


8.10 Generate the histogram of the variable Z = Y/X where X and Y are standard 
Gaussian variables. 


8.14 Problems 367 


8.11 The height of a homogeneous population is a Gaussian variable with (X) + 
o[X] = 175 + 8 cm for men and (Y) + o[Y] = 165 + 6 cm for women. Knowing 
that there is a positive correlation with » = 0.5 between the height of husband and 
wife (couples tend to have a similar height), find, by using a simulation code, the 
percentage of couples with male and female taller than 180 and 170 cm, respectively 
(see also Problems 4.6, 4.7). 


8.12 Solve Problem 5.9 with a simulation code. 


8.13 Parallel lines are drawn on an infinite plane at a unitary distance. A unit length 
needle is randomly thrown on the plane. It can be shown that the probability that the 
needle falls across a line is p = 2/z (Buffon’s needle problem [Gne76]). Estimate 
the value of 2 with a Monte Carlo code. 


8.14 In the right-left (or top-bottom) asymmetry problem, events can occur “to the 
left” with probability P or “to the right” with probability 1 — P. Simulate 5000 
left-right experiments and count, over N = 50 events, the number of times with n, 
events on the left and ng = N — ns events on the right. Make the histogram of the 
asymmetry A = (ns — ng)/N and compare the standard deviation of the data with 
the true one evaluated in the Problem 12.8. 


8.15 Using the MCbinocov routine with CL = 95%, find the smallest value of 
n such that the approximate Eq. (6.32) gives a difference between coverage and 
CL < 5%. 


8.16 Using the MCpoisscov routine with CL = 95%, find the smallest value 
of jz such that the approximate Eq. (6.45) gives a difference between coverage and 
CL < 2%. 


8.17 The average prices (in $) of the shares (s) and the bonds (b) of the New 
York Stock Exchange in the years 1950-1959 are shown in the following table 
from [Spi61]: 


1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 


$35.2 39.9 41.9 43.2 40.1 53.3 54.1 49.1 40.7 55.2 
b 102.4 100.9 97.4 97.8 98.3 100.1 97.1 91.6 94.85 94.65 


An economic theory predicts a negative correlation between stock and bond prices. 
Use the non-parametric bootstrap method to test the model at the 5% level. 


8.18 Using the bootstrap method, find the confidence interval with CL = 90%, for 
the odds ratio of Problem 6.16. 


8.19 Write an MC code for the calculation of Tukey quantiles (7.66). 


Chapter 9 
Applications of Monte Carlo Methods od 


THUMB’S FIRST POSTULATE 


It is better to solve a problem with a crude approximation and 
know the truth, plus or minus 10 percent, than to demand an 
exact solution and not know the truth at all. 


Arthur Bloch, “MURPHY’S LAW BOOK Two”. 


9.1 Introduction 


The fields of application of the Monte Carlo (MC) methods are practically unlim- 
ited, and it is really difficult to deal organically with such a wide variety within the 
limited space the of a chapter. 

However, we think it is useful to work out the detailed solution of a few general 
problems, starting from the overall framework up to the extreme detail of the 
simulation code. In this way, you will acquire a complete mastery of the procedures, 
so that you will be able to successfully implement and adapt these methods to your 
specific problems. 

The examples presented in this chapter show the variety of contexts where MC 
methods can be applied: the process of particle diffusion in matter (a typical problem 
of experimental physics, chemistry or engineering), the calculation of the optimal 
number of workers in a plant (an instructive application to industrial management), 
some applications of the Metropolis algorithm (study of systems with a large 
number of identical components, which are of interest in economics, engineering, 
physics and chemistry) and, finally, the calculation of the value of definite integrals 
(here we are in the fields of theoretical physics and mathematics). 


9.2 Study of Diffusion Phenomena 


The propagation of particles in a given material, such as neutrons in a nuclear 
reactor, electrons in a metal or in a semiconductor and photons in the atmosphere, 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 369 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_9 


370 9 Applications of Monte Carlo Methods 


is usually referred to as diffusion. In this type of processes, each particle, starting 
from an initial state, follows a predetermined trajectory during a certain time period 
up to a random instant in which it undergoes an impact, i.e. an interaction, with the 
traversed material, with a possible production, in specific situations, of secondary 
particles. As a result, the particle changes, always randomly, direction and velocity 
modulus and continues its path until the next collision. The general scheme is that 
of Fig. 9.1. 

From a theoretical point of view, this system can be described by the Boltzmann 
(or transport) equation, whose derivation is given in many advanced physics, 
chemistry or engineering texts, such as [Lam66]. 

Unfortunately, as J. Lamarsh correctly commented in the above reference, “it 
is much easier to derive the Boltzmann equation than to solve it”, since it has a 
complex integro-differential form, where, due to the complicated configuration of 
real systems, there are rapidly varying parameters depending both on space and 
particle energy. 

MC methods are thus almost always the only available approach to these 
problems, and, despite the great variety of diffusion processes, it is also possible 
to outline a simulation scheme of almost universal validity which, for example, is 
used by the simulation codes widely used in particle and nuclear physics, such as 
GEANT [A*03] and MCNP [W* 18]. 

Let us then consider, as a useful practical example, the case of neutron diffusion 
within a material. Here, as for the y rays and for the other electrically neutral 
particles, the calculation of the trajectories is very simple, since neutrons follow 
straight line paths between two successive collisions with the atomic nuclei of the 
traversed medium. 

Let us consider, as in Fig.9.1, a point source isotropically emitting neutrons 
with kinetic energy of 0.0038 eV,! located at the centre of a homogeneous sphere 
of infinite radius and composed of the diffusing material. The MC procedure that 
we present can be divided into various steps, which are common to all diffusive 
calculations (see also [RvN63]): 


(a) Geometry, materials and interactions Since the medium in which the neutrons 
propagate is infinite and homogeneous, our system is very simplified, and it 
is therefore not necessary to introduce any geometric information. Instead, we 
need to know which parameters define the neutron interactions with atomic 
nuclei. As for the other particles, these processes are described by a quantity, 
called “total macroscopic cross section” (or, more briefly, 7), which gives the 
probability, per unit path, of having any type of interaction. 2’y is an intrinsic 
property of any material, is constant in homogeneous media and depends on 
the particle kinetic energy but not on the previous particle path. In the case of 


' Electron-volt, or eV, is the energy unit widely used in atomic and nuclear physics. By definition, 
it is the kinetic energy acquired by an electron in passing through two points where a potential 
difference of 1 Volt exists. It is easy to show that leV = 1.60210~!?erg. Neutron of 0.0038 eV are 
in thermal equilibrium with matter at the room temperature of 27°C. 


9.2 


Fig. 9.1 Schematic 
representation of the 
trajectory followed by a 
thermal neutron during the 
diffusion process 


(b) 


Study of Diffusion Phenomena 371 


neutron trajectory 2 


elastic collision 


flight distance 


absorption 


a non monoenergetic source, it is therefore necessary to know, with sufficient 
precision, the behaviour of 2’ in the considered energy interval. Suppose, as 
it is reasonable to do for thermal neutrons, that the possible interactions are of 
two types: absorption or elastic scattering (change of direction without loss of 
energy); with this hypothesis the total macroscopic cross section is obviously 
decomposed as: 


LIT = Da a Del , (9.1) 


where 2, and 2; are the microscopic cross sections of the considered 
processes. Since the neutrons we are considering are monoenergetic, we need 
only two numbers, read as initial parameters, to quantitatively define their 
interactions. Other data that will be needed (and which will always be read 
at the beginning of the programme) are the mass number of the target nucleus 
(necessary for the calculation of the trajectories) and the speed of the neutrons, 
as we are also interested in determining the time needed for their absorption. 
Kinematics The neutron emission point is fixed and, for convenience, is set at 
the centre of the coordinate system. The flight direction of the neutrons exiting 
the source is determined, on the basis of Eq. (8.32), by randomly generating the 
angles g and 7 through the relations: 


gp = 208 
0 = arccos(1 — 2&2). (9.2) 


372 


(c) 


9 Applications of Monte Carlo Methods 


Then, the direction cosines are calculated: 


a = sinvcosg@ 
B =sinvsing (9.3) 
y =cost 


which, with the knowledge of the starting point, allow us to unambiguously 
define the initial trajectory parameters of the particle. 

Tracking Based on the scheme of Fig. 9.1, a tracking step coincides with the 
flight distance, since the neutron does not have any other interactions between 
two successive collisions. This parameter can be calculated by noting that, 
according to the very definition of macroscopic cross section, the neutron 
interaction in any of the possible ways is a stochastic process having the same 
characteristics as those we have described in Sect.3.7. In fact, the neutron, 
during its path in the traversed material, undergoes collisions (i.e. “gener- 
ates” events) that are uncorrelated, discrete and with a constant probability 2’7; 
then, similarly to Eq. (3.48), the probability of having a distance x between two 
successive collisions is: 


p(x) dx = Spe **T dx. (9.4) 


From this distribution, (whose mean |/’r is the mean free path between two 
successive collisions), and using Eq. (3.91) rewritten as: 


1 
d=-—l 
=, ng , (9.5) 


it is possible to associate a distance d to each neutron, equal to the length of 
the simulation “step” between the starting point and the one where the next 
collision occurs. 

Since we know both the emission angles and the covered distance, we are 
able to calculate the coordinates of the interaction point between the neutron 
and a nucleus of the diffusing material. Next, we need to evaluate the effects 
of this interaction on the trajectory of the particle, knowing that the probability 
to be absorbed is given by X,/2’7, while that of being diffused is LX’; /2’r. 
The choice between these two alternatives is made by drawing a new random 
number &4: if 0 < &4 < (24/27), the neutron is absorbed, and the current 
event ends; if, instead, (24/7) < &4 < 1, an elastic scattering occurs and a 
new flight direction is generated, as shown in Fig. 9.1. 


9.2 Study of Diffusion Phenomena 373 


For thermal neutrons, the collision process is isotropic with respect to 
the direction of incidence in the neutron-nucleus centre of mass system.” The 
azimuthal (@¢y,) and polar (7m) emission angles must then be generated in this 


system: 
Dom = 1 — 2&5 (9.7) 
Pcm = 27 && 
and transformed into the laboratory system [Lam66]: 
gy = Pem 
9.8 
ee 1+ Acos Vem (9.8) 


VA2+2Acostem +1, 


where A is the mass number of the target nucleus. Afterwards, the direction 
cosines of the new flight direction (a@’, 6’, y’) of the neutron are calculated as: 


a’ = wa +a(ay sing + Bcos@) 
b' = uB + a(By sing — acos ¢) (9.9) 
y’ = py —a(l—y”)sing 


where: 
1— p2 
a= 5 3 M=cosv ; |yl|Al. (9.10) 
Ly 
If |y| = 1, one instead has: 
a’ = ybcos@ 
B' = bsing (9.11) 
y=ye 
2 Consider an set of N particles of masses m;,m2,...,my having velocities v1, v2,..., VN 
respectively, in a certain reference coordinate system. The point defined with the equation: 
Dimi ti 
Tom = TN... 
i=1 Mi 


(where r; is the distance of the i-th particle from the origin of the coordinate system) is called 
centre of mass. Similarly, its velocity can be defined as: 


imi: Vi 
Sit 
in mi 


A reference system where V;», = 0 is called centre of mass system of the particle set. It allows, as 
in our case, to describe many nuclear reactions in a simple way. 


(9.6) 


Yeon = 


374 


Fig. 9.2 The direction 
cosines (a, B, y) and (a’, B’, 
y’) represent the neutron 
flight direction before and 
after the collision, 
respectively. The collision 
occurs at the point O’, 
whereas v is the polar angle 
of the scattered neutron with 
respect to the initial direction 


9 Applications of Monte Carlo Methods 


with b = /1 — y2. The geometry of the collision process in the laboratory 
system is shown in Fig.9.2. The demonstration of these transformations is in 
our web pages [RPP]. 

(d) Event storage: In the case of elastic scattering, the most relevant parameters of 
the neutron trajectory are updated before returning to the previous point and 
continuing to follow the particle path: 


The total distance d; travelled between the source and the point where the 
last collision occurs (which we suppose to be the i-th one): 
The projections of d; on the Cartesian axes: 


Xj = xj-1 + dja (9.12) 
yi = yi-1 + diB (9.13) 
Zi=zi-1+diy (9.14) 


(obviously x9 = yo = zo = 0) 

The flight time d;/v (v is the velocity modulus of the considered neutrons; 
in our case v = 2.2- 10° cm/s); 

The total number of collisions. 


On the contrary, when the neutron is absorbed, after k collisions, the quantity: 


r= /xpt+typ+z (9.15) 


is calculated, which is the distance between the source and the final absorption point 
(see Fig. 9.2). 

The procedure ends when the predetermined number of particles has been 
generated. You can perform this simulation using our code MCneutrons. The 


9.2 Study of Diffusion Phenomena 375 


tow 


1200 wg 
2000 


400 800 
1000 


0 
0 


0 50 100 150 0 50 100 150 
m = 36.83 s =37 m=16.74 s=16.82 


fe) 
fos 


200 400 600 
100 200 300 


0 
0 


QO 1000 3000 5000 7000 0 100 200 300 400 
m=1419 s=1425 m=116.29 s = 82.79 


Fig. 9.3. Final histograms from the simulation of 10,000 neutrons in Carbon. (a) Zigzag distance 
(m). (b) Time (ms). (¢) Number of collisions. (d) Flight distance (cm) 


results provided by the programme, from 10,000 simulated events, are shown in 
the histograms of Fig.9.3 when the target nucleus is carbon (A = 12, Xi = 
0.3851cm7!, 5, = 2.728-10-*cm7!). 

From the analysis of the histogram distributions, we can both obtain the main 
parameters characterizing the neutron diffusion in carbon and also infer some 
important additional information on this process. 

Let us examine histogram (a) of Fig. 9.3 (zigzag total neutron travelled distance): 
the error on the mean m from the histogram is s/./ 10000 ~ 36 (s is the histogram 
standard deviation). Within the statistical error, there is then coincidence between 
mean and standard deviation. This is an indication that the parent population is a 
negative exponential density (see Sect. 3.3). It is easy to understand this feature if 
we observe that the only quantity influencing the travelled path is X,. Recalling 
the same considerations that led us to Eq. (9.4), the initial probability density must 
therefore be: 


p(x) = Sye™, (9.16) 


376 9 Applications of Monte Carlo Methods 


and in fact, always within the statistical errors, the following identity is verified: 


1 

MH =O = a (9.17) 
Similar considerations also apply to histogram (b), which represents the elapsed 
time between emission and absorption. This quantity is nothing else than the ratio 
between the total travelled distance x and the velocity v of the neutrons, so this 
distribution has the same characteristics as the previous one. To explicitly obtain 
the parent density p(t) (with t = x/v) of this histogram, we just need to apply the 
transformation law (5.7): 


Os pox) = y Ege! Be (0.18) 


Let us now consider the distribution of the number of collisions that each particle 
had before absorption (histogram (c)). At first sight, taking into account that each 
interaction can be considered a rare, uniform and stationary event, one might expect 
this histogram to be Poisson distributed. However, also in this case, there is a 
coincidence between the mean and standard deviation, a characteristic indication 
of a negative exponential distribution. 

Let us examine the situation in more detail. Contrary to what happens, for 
example, in the flip of a coin, where the probability of having heads or tails is 
not affected by previous tosses, the alternative elastic scattering-absorption has an 
effect on following interactions since, if absorption occurs, the neutron path ends, 
and the possibility of having other collisions is thus excluded. On the basis of these 
considerations, it is easy to realize that only the number of collisions per unit path 
length is Poissonian, since this quantity only depends on the elastic scattering, a 
process where each collision is certainly independent from the previous one. From 
these considerations, it is easy to determine that the distribution of the total number 
of collisions follows the geometric law (3.7) with probability of success, i.e. of 
absorption, p = X,/'7. As we noted in Sect. 3.7, this distribution, when p < 1 
(as in our case), is practically indistinguishable from the exponential distribution 
with parameter p of Eq. (3.50); for this reason, always within the statistical errors, 
the mean and variance of the histogram are equal to 7 / Liq. 

Finally, to interpret histogram (d), we note that the flight distance (9.15) is the 
modulus of a vector whose components xx, yx, Ze are realizations of a sum of 
independent random variables from populations with finite variance: 


Neoll Neoll Neoll 


t= GS N= aes ae Yay (9.19) 


i=1 i=1 i=l 


Due to the Central Limit Theorem, x, yz and zx are Gaussian variables; then on 
the basis of Pearson’s Theorem 3.3, the p.d.f. p(r) of the flight distance r could be 
Maxwellian distributed. However, in this specific case, a condition of this theorem is 


9.3 Simulation of Stochastic Processes 377 


not verified: although d;a;, d; 6; and dj y; are independent and have finite variance, in 
sums (9.19) the number of collisions per event Neoj is a random variable. Then p(r) 
can be considered as a superposition of different Maxwellian functions depending 
on the Neo values. These types of variables are known as stochastic sums; their 
p.d.f. are derived in some books as [PUP02]. We omit the derivation of the analytic 
solution, known as Fick’s law and report the final result [Lam66]: 


ro Del 
prny=—e, = L= /—,, (9.20) 
i? 35,57 


where the parameter L (in our case L = 56.3 cm) is called diffusion length, whose 
square is proportional to the mean square flight distance travelled by a neutron 
from the source to its absorption point. In this graph we have also drawn with a 
line the function calculated by assuming the formula (9.20) as the model of the 
parent population of the histogram and applying Eq. (6.98). The x7 test to check 
the consistency between sample and population, applied as explained in Sect. 7.5, 
provides values xe ~ 1 and p-values much greater than 5%, indicating, as is also 
evident from the figure, a good agreement between sample and population. 


9.3 Simulation of Stochastic Processes 


Also the study of the time evolution of any stochastic process becomes very 
convenient, if MC methods are used. As an example, we will consider a typical 
operational research? topic: the study of waiting phenomena. 

Queues of people in front of a service desk are the best known example of the 
systems we are about to examine: from them the theory of waiting phenomena has 
taken its name and most of the terminology, but the classes of processes that can be 
analysed in this framework are very numerous, and many of them are also closely 
related to several moments of our daily life. The regulation of urban car traffic 
or of a train station, the scheduling of airline flights or the number of checkout 
counters in a supermarket and the operation of a warehouse or the maintenance 
system of a factory are just some of the problems that can be interpreted and solved 
through queuing theory. Its knowledge is therefore essential when dealing with 
methodologies related to industrial, economic or social management. 

In a schematic way, the structure of a waiting phenomenon includes a certain 
number of “service stations’ or “channels” (which can be clerks or workers assigned 
to a certain task, communication lines, parking lots, etc.) carrying out the work 
requested by “customers’(people waiting for a service, machines to be repaired, 


3 Operational research is a discipline that applies mathematical methods to the study and analysis 
of problems involving complex systems in order to find suitable solutions for their optimal 
organization. 


378 9 Applications of Monte Carlo Methods 


goods to be shipped, etc.). The serving system can be in two different states when a 
new customer arrives: 


(a) There is at least one free station: the customer’s request is immediately taken 
over, and this operation occupies one or more stations for a certain period of 
time. 

(b) All stations are busy: the customer waiting to be “served” is put in a “queue”, 
whose characteristics are very different depending on the considered system. 


The purpose of this study is to evaluate the overall performance of the production 
process taking into account the costs related to both customer waiting times and 
the number or periods of inactivity of the serving stations (for a more complete 
discussion, see [CS61]). 

The more complicated part in simulating this type of process is to correctly define 
a “clock” that reproduces the continuous event timeline. This problem can be solved 
in two different ways, which in the literature are called synchronous or continuous 
simulation and asynchronous or discrete simulation [BFS87]. 

In the synchronous simulation, the clock problem is solved in an extremely 
simple way: a unit of time (second, minute, hour, etc. ) is defined in an arbitrary 
way, as long as it is small compared to the characteristic event rates of the system, 
and at each programme cycle, the clock is advanced by one unit. Simulation is thus 
discrete, but it is also possible to simulate a continuous process as the unit of time 
increment in a cycle is small compared to the transition times of the system. 

Then, knowing the time probabilities 0 < A; < 1 (in the chosen unit) of each 
possible event or change of state of the system, a random extraction of a set of 
uniform random variates 0 < & < 1 is performed, and the event occurs if & < Aj; 
otherwise the system is left unchanged. For each predetermined time interval Af, 
the temporal averages and the standard deviations of the characteristic quantities 
describing the system are then calculated, and, if necessary, graphs and histograms 
are created at the end of the simulation. As an example, Fig.9.4 shows how to 
simulate the failure of a machine with a mean of five failures per hour, choosing 
the second as a time unit. 


Fig. 9.4 Synchronous 5/3600 = 5 events per hour 
simulation of an event with a 
five event/hour probability 0 1 
fie | 
| I | 
\ t=ttl 
an event occurs t=t+l 
change of state system does not change 


RANDOM (at every second) 


9.3 Simulation of Stochastic Processes 379 


0 | an 0 


Ll 
0 5 10 15 0 2000 4000 6000 
events/hour seconds betw. events 


Fig. 9.5 Histograms of the average number of events per hour and of the time gaps between two 
successive events generated with the synchronous simulation of Fig. 9.4. The solid line shown in 
the time histogram is the fit of the data with the negative exponential law 


Let us now ask ourselves the fundamental question: is the synchronous simula- 
tion in agreement with the general law of independent stochastic phenomena, which 
predicts a Poissonian event distribution and adjacent events separated by exponential 
time intervals? The answer is yes, as long as the condition 4; < 1 holds. Indeed, 
in this case the probability of having n elementary time instants between adjacent 
events is given by the geometric density (3.50), which, as repeatedly noted, becomes 
an excellent approximation of the exponential density (3.50) with parameter p when 
p <_ 1. In support of this assertion, we show in Fig.9.5 the histograms of the 
simulation, for 10,000 h, of the number of service requests per hour and the number 
of seconds between two adjacent events for the process displayed in Fig. 9.4, where 
dX = 5/3600 s7!: as you can see, the number of events per hour is Poissonian, and 
the time gaps between two events follow the negative exponential density. Note that, 
even with an average time between two events of about 3600/5 = 720s (12 min), 
you can have waiting times as long as two hours! This is one of the reasons for the 
large fluctuations in temporal averages that often occur in stochastic phenomena, a 
feature that simulation is able to accurately reproduce. 

We now come to the other type of simulation, the asynchronous or discrete one 
[Bun86, Ros96], often referred to simply as Monte Carlo simulation. In this case, 
one exploits the fact that the system changes its state only when very specific events 
take place (e.g. the arrival of a new customer or the end of a station’s service), 
otherwise remaining substantially unchanged. 

Thus the time of the simulation clock does not advance regularly but, at variable 
intervals, obtained through Eq. (3.90), which marks the arrival of a new event. 
At this time all the indicators describing the state of the system to be studied 
are updated (let us assume again for simplicity that each system change occurs 
immediately). The time instant in which a new event occurs is determined by sorting 
a “list”, where all the possible types of events that can happen in the system, as well 
as their instant of occurrence, must be recorded. 


380 9 Applications of Monte Carlo Methods 


SYNCRONOUS SIMULATION ASYNCHRONOUS SIMULATION 


parameter parameter 
definition definition 
y 


V 
t=0 t=to 
no 
no 
simp a) 
tg<tmax ? sto 
yes 0 >C stop) 
t=tt+1 Y yes 
} t of the next 
y 
yes — 
change means within (t—tg) 
of state V 
change of state 
ane atin 
Y y 
averages |__| storage of the state 
within At 


Fig. 9.6 Flux diagrams for the synchronous and asynchronous simulations of waiting phenomena 


Generally, synchronous simulation results in computational codes much simpler 
than those using the asynchronous one, since the latter must ensure the correct 
ordering of arrival times. This task is often difficult, especially if the system is com- 
plicated. However, asynchronous simulation is sometimes absolutely preferable, as 
the computational time required by synchronous simulation, which runs a fixed 
cycle even when the system remains unchanged, can be unacceptable. Typical flux 
diagrams for synchronous and asynchronous simulations are displayed in Fig. 9.6. 

The goal of this type of simulations is usually the determination of the averages 
of some characteristic quantities within predetermined time intervals. To fix ideas, 
we might be interested in the number of customers waiting at a gas station or in 
a supermarket checkout line, averaged within an hour (taken as a unit of time). To 
obtain these averaged quantities, it is necessary to record the time instants ¢; of 
the system modifications (e.g. when a new customer is added to the queue) and to 
calculate, for each variation, the quantity: 


Xi — H-1) = Xi AG; , (9.21) 


where x; is the value of the variable (discrete or continuous) before the variation at 
tj. 

If we observe the process for a long time of t hours (as an example, for a day), 
we can define an average quantity as: 


x; At; 
Dia (9.22) 


mt = 


9.3 Simulation of Stochastic Processes 381 


If we divide the measurement period in rather small and equal intervals (as in the 
synchronous simulation) At; = 4;, so as to contain at maximum one change of state, 
and we assign to each 6; the value x; of the last variation, this formula is nothing 
more than the normal mean of a sample of n = t/6; events. 

The mean and variance of X can be obtained using a simulated sampling on N 
cycles of duration ¢ (e.g. several days, a week or a month) and applying the standard 
statistical formulae: 


fee ie 
(x) = W Si(mi)i ; (x?) = WV Som)? , (9.23) 


i=1 i=1 


2 N 2 2 
%@)= [(= (x) | (9.24) 
In these processes, the estimate of the X variance must be evaluated with Eq. (9.24), 
by progressively computing the partial means, since the final average value will be 
known only at the end of simulation. 

Notice that fluctuations on the final result now depend not only on the statistical 
error but also on the variability over time of the parameters that describe the state 
of the process under study. However, almost always, this type of systems reach, 
after a certain operational time, a steady state in which the quantities (9.23, 9.24) 
no longer have appreciable variations in time. Determining when the steady state 
is reached is not generally an analytically solvable problem, since it depends on 
the particular characteristics of each individual process, such as the intensity of 
customer traffic, the shape of the distribution of arrivals and service times. For this 
reason, the results are usually printed at regular intervals of simulated time in such a 
way as to empirically observe the variations and the convergence to the equilibrium 
value of the state variables of the system. 

Even the evaluation of the statistical error is quite complicated since all the events 
are correlated (the waiting time of a generic customer depends,e.g. on those of the 
immediately preceding customers), so that the use of formulae valid for independent 
observations would lead to wrong results. 

A general result of statistics, which remains valid, is that of the error of the mean 
(X), in the case of steady state, will decrease as the square root of the number of 
sampling cycles, which are proportional to the total simulated time 7}; m: 


1 
omc XK >=. (9.25) 
Vv Tsim 


One of the most used algorithms for the statistical error calculation is the batch 
means method which we will be shortly discussed in Sect. 9.7. Alternatively, two 
simple procedures can be applied: 


¢ Once the simulated time interval of the process under study has been determined, 
the entire programme is repeated, with different sets of random numbers, a 


382 9 Applications of Monte Carlo Methods 


number of times (at least 15 or 20) large enough to apply the Central Limit 
Theorem for the calculation of the error on the mean of the obtained results. 

¢ On the other hand, if one wishes to find the minimum simulated time interval 
necessary to obtain a predetermined precision on the final results, it is necessary 
to carry out a short preliminary test to obtain the error on the result for a small 
value of Tsim. The requested value is then easily obtained by exploiting the 
proportionality law (9.25) on the statistical error. 


9.4 Number of Workers in a Plant: Synchronous Simulation 


To better exemplify and clarify all the previous considerations, we solve, using 
synchronous simulation, the problem of determining the optimal number of workers 
to be assigned to the control and maintenance of a certain number of industrial 
machines. 

Suppose that the main characteristics of the system to be studied, deduced from 
empirical observations, are the following: 


e Inthe plant there are ten machines, each of them requires on average 3 interven- 
tions per hour both for the failure repair and during the standard operational work 
phase. 

¢ The duration of each intervention (which requires the activity of only one 
person) follows a negative exponential law with an average of two interventions 
completed in 1h by each worker, regardless of the person who carries it out and 
of the machine requiring it. 

¢ The hourly cost linked to the inactivity of a worker (c)) is estimated at 70 euros, 
while that linked to the inactivity of a machine (c2) is 20 euros. 


First of all, it is necessary to begin with a concise (but complete!) description of 
the system. We remind you that, on average, there are three intervention requests 
per hour per machine and that, on average, each worker repairs or maintains two 
machines per hour. The objective of the optimization is to minimize the cost of 
the system, knowing that a machine waiting for intervention costs 20 euros/hour 
and an inactive operator costs 70 euros/hour. The simulation will therefore have to 
determine the average number of inactive operators (N,;) and of waiting machines 
(Nmi) and minimize the average hourly cost: 


C =70- (Noi) + 20+ (Nii) - (9.26) 


The short description of the system is reported in Table 9.1. 

The second step is the execution of the routine MCsinc, which is partially 
reported below. The chosen time unit of measurement is the second, and, therefore, 
the probabilities of intervention request and end of repair are, respectively, 3/3600 
and 2/3600. At each second, the programme examines the state of the machines 
(initially all running, variable Ms [k] =0). If a particular machine works, it requires 


9.4 Number of Workers in a Plant: Synchronous Simulation 


383 


Table 9.1 Logical description used for the simulation of the system “number of workers in a 


plant”. 
awaitin, 


The status variable of a machine can take on three values: running (0), under repair (1), 
g repair (2) 


Machine Change of Inactive Variation Machine 
status state workers inact. workers status 


Imenenionneedal WOE 


Awaiting repair 


action 


Intervention finished +1 0 


if the uniform random variable € < 3/3600; if instead the machine is being 


serviced by a technician, an end of service occurs if € < 2/3600. Finally, if the 
machine is waiting for intervention and there are free technicians, it is served 
(Ms [k] =1); otherwise it remains waiting (Ms [k] =2). This is the phase of state 
change, indicated in the second column of Table 9.1 and also in the block diagram 


of Fig. 9.6. The part of the routine that performs the status sampling is as follows: 
Hour = 60 # unity of measure: one hour= 60 minuts 
Minuts = Hour*xH # minuts of the simulation 
Mp = 3/Hour # 3 request of interventions/hours 
Op = 2/Hour # 2 end of interventions/hours 
Ol = On # O1 = number of workes in stand-by 
Oi = 0 # Oi = cumulative number of workers in stand-by 
Mi = 0 # Mi = number of machines waiting for an intervention 
# simulation for H hours in steps of 3600/hour seconds 
for(n in 1:Minuts) { # beginning of general loop 
for(k in 1:Mn) { # scan the machines 
if (Ms[k] == 0) { 
if(runif(1) < Mp){ # request of intervention 
if (O1>0) { # there are workers in stand-by 
Ms [k] =1 # machine number k under repair 
Ol = O1-1 # update the number of workers 
} # in stand by 
else{ Ms[k] = 2 } # machine number k waiting 
} # for repair 
} 
else if(Ms[k] == 1) { machine k under repair 


# 
if(runif(1) < Op){ # end of intervention? 
Ms[k] = 0 # machine repaired 
Ol =oOl+i1 # update the number of free workers 
} 
} 
else if (Ms[k] == 2){ # machine k is waiting for repair 


if(Ol != 0) { # are there free workers? 
Ms[k] = 1 # machine k under repair 


384 9 Applications of Monte Carlo Methods 


ol = 01-1 # update the number of workers 
} Ht in stand by 
} 
} 
for(k in 1:Mn) { # number of machines waiting repair 
if (Ms [k]==2) {Mi = Mi + 1} 
} 
Oi = Oi + Ol # cumulative of workers in stand by 


The number of inactive operators Oi and the number of machines awaiting interven- 
tion Mi, which are the variables to be considered, are updated every second. These 
quantities are averaged every 24h according to Eq. (9.22), and the corresponding 
means and variances are progressively calculated with Eqs. (9.23, 9.24). Figure 9.7 
shows the trend of Oi and Mi for a simulated observation period of 500 days. 

The means and the final variances of these quantities are reported in Table 9.2, 
with the evaluation of the overall cost through Eq. (9.26). An examination of the 


40 ai 

30 
30 25 

20 
20 15 
10 a 

! ! 
% 0.1 02 03 04 1.5 2 2.5 
inactive workers/day waiting machines/day 


0.05+ | 
l a l | 
% 200 400 12) 200 400 
days days 


Fig. 9.7 The lower plots show the daily trend of the number of machines awaiting service and the 
number of inactive workers per hour for the “optimal number of workers in a plant” problem, when 
five persons are employed; the projection on the ordinates of the daily values is shown in the upper 
histograms. The mean m and the m + s values of the top histograms are the solid and dashed lines 
of the bottom plots 


9.5 Number of Workers in a Plant: Asynchronous Simulation 385 


Table 9.2 Changes in the global cost of the plant with the number of the employed workers. The 
errors are the sample standard deviations, not errors on the mean 


Inactive Waiting 
Number of workers/hours machines/hour Global cost 
workers ((Noi) = Soi) (Nmi) = Smi) per hour 
3 1.5-10-343-1073 5.00 + 0.21 100.1 
4 2.21 - 107? + 1.84- 107? 3.38 + 0.26 69.1 
5 0.148 + 0.055 1.91 + 0.23 48.6 
6 0.53 + 0.12 0.87 + 0.15 52.4 


figure and of the table shows relevant fluctuations around the average values. In 
fact, the plant is not very efficient, because, given ten machines with an average of 
three requests for intervention per hour and an average service time of half an hour, 
there will always be a large number of inactive machines (about five or six) even 
with ten workers, one per machine. However, given the exponential law assumed 
for both intervention and service request times, which are also comparable with 
each other, it is natural to expect large fluctuations in the daily quantities observed. 
In any case, since these features are given a priori, the simulation still solves the cost 
minimization problem, showing that the optimal number of workers is five, as shown 
in Table 9.2. In Fig. 9.8, the average number of inactive workers and the number of 
machines awaiting repair for each hour of operation of the plant are reported. The 
fluctuations present at the beginning of the simulation are due both to the specific 
initial conditions assumed for the system, far from the stationary regime, and to the 
insufficient quantity of data. From this figure, it can also be deduced that stationarity 
is reached quite quickly. 

It should also be noted that the means and standard deviations of Table 9.2 are 
derived from a 500-day observation. Considering as an example the optimal case of 
five workers, from the formulae of Table 6.3, we can estimate the errors on these 
quantities, using Eq. (9.25), as: 


[(Nmi)] 27 068, eR AIS 0007 
o yie =0.08, ols |x =0. 
ii 500 7 1000 
0.148 0.12 
o [(Noi)] = = 0.004 


5 = 0.007 > Oo [s(Noi)] = 


9.5 Number of Workers in a Plant: Asynchronous Simulation 


The synchronous simulation has the disadvantage of carrying out many unnecessary 
cycles, in which the system does not change state. To speed up the synchronous 
code, we changed the unit of time from second to minute, obtaining results different 


386 9 Applications of Monte Carlo Methods 
Ne 
oo] "4 
£ 
a) 3. b) 
6 = 
8° 5 
x Qo 
5 n Oo] 
g o 
6 £ 
$07 oe 
Ss G 
3 £ 
& Dr | 
Si £2 
o7 = 
G 
so] 
fo} 
T T T T T T T T T T T T 
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 
hours hours 


Fig. 9.8 Variation, as a function of the simulation time, of the average number of inactive workers 
(a) and of the fraction of machines awaiting repair (b) for each hour of plant operation and in the 
case of six workers 


only by a few percent from those of Table 9.2. Furthermore, in this scheme, only 
exponential time distributions for intervention and repair are allowed. A more 
general and quicker way to describe the system is to use asynchronous simulation. 

In this case, time does not flow at regular intervals but in uneven jumps and 
only at event occurrence times. In the considered system, there are two types of 
events: request for intervention and end of intervention. In the simulation code 
MCasinc, it is then necessary to define the vectors Tmac[k] and Top [k] 
which contain, for each machine, the intervention request and end-of-service times, 
respectively, evaluated with the R rexp routine by using the exponential law. 
During the initialization phase, a series of request times are extracted while the end- 
of-service times are “frozen” with an “infinite’BIG time. The core of MCasinc 
which generates the simulated events is the following: 


for( k in 1:Mn) { 
Tmac[k]=rexp(1,rate=Mp) # exponential intervention request 
Top[k] =BIG # all workers are available 


} 


# Days of simulations 
while (t<=TIME) { 


# awaiting machines Mi and free workers Ol in (t-tprec) 
tprec=t; 
Mi=0; 
for(j in 1:Mn){ if (Ms[j]==2) { Mi=Mi+1} } 


Ol = On - occ; 
#search for indices of the minimum time and of the state change 
kmac = which.min(Tmac) 
kop = which.min(Top) 
if (Tmac [kmac] <Top [kop] ) { # one machine is out 
t= Tmac[kmac] 
if(occe<On) { # there are free workers 


9.5 Number of Workers in a Plant: Asynchronous Simulation 387 


occ = occ + 1 


Top[kmac] = t + rexp(1,rate=Op) 
Ms[kmac] = 1 
Tmac[kmac] = BIG 
} 
else { # awaiting machine 
Ms[kmac] = 2 
Tmac[kmac] = BIG 
} 
} 
else { # one worker is free 
t = Top [kop] 
oce = occ: -1; 
Ms[kop] = 0; 
Tmac[kop] = t + rexp(1,rate=Mp) 
Top[kop] = BIG 
n=1 
flag =0 
while (flag==0 && n<=Mn) { # use a free worker 
if(Ms[n]==2) { # if there are awaiting machines 
Ms[n] = 1 
occ = occ + 1 
Top[n] = t + rexp(1,rate=Op) 
flag = 1 


} 
nen-eti 
} 
} 


oinh = oinh + Ol«(t-tprec) # daily data of O1 and Mi 
minh = minh + Mix (t-tprec) # moving means 


The simulation clock moves forward by searching for the minimum of the time 
instants contained in these two vectors, using the which. min routine. If a machine 
is waiting for repair or a worker becomes inactive, the request or end-of-service 
times must be stopped by introducing into the Tmac and Top vectors the time BIG. 

Now suppose that we have already considered a certain number of events; the 
next one will be either a new request for intervention or the end of a worker’s 
service. If the event is a request for intervention on the machine kmac, the 
simulated time is updated, an operator is appointed (if available), the machine 
“is frozen’(Tmac [kmac] =BIG) and the end of intervention time is simulated 
(Top[kmac ]=t + texp(Op)); if there are no free operators, the machine 
is set in state 2 (Ms [kmac] =2). If, on the other hand, the event is the end of 
the operator’s service on the machine kop, a new request of intervention time 
is generated Tmac [kop] =t + texp (Mp), the machine repair time is set as 
(Top [kop] =BIG) and the vectors and status indices are updated, as commented 
in the programme. If there are other machines in the queue, the operator who has 
just finished his job immediately takes care of a waiting machine (the number n), 
with an intervention that will end at the time Top[n ]=t + texp(Op). The 
number of waiting machines Mi and the number of free operators O1 are multiplied 
by the time between two events (t-tprec), obtaining the new variables oin 
and min. This is an important point: this time interval multiplies O1 and Mi set at 


388 9 Applications of Monte Carlo Methods 


the previous time tprec, since between tprec and t, the system state has not 
varied after the change at tprec. This part of the code is equivalent to that of the 
synchronous simulation where the same quantities were summed up every second, 
even if unchanged. The daily averages of (9.22) are then updated by dividing oin 
and min by the 24h interval. The moving averages and the relative variances are 
then calculated as in the case of the synchronous simulation. The simulation ends 
when the time t reaches a predetermined value. 

As can be seen, the logic and structure of the code are more complicated than 
in the case of synchronous simulation. However, the execution time was about 15 
times shorter, because now the system state transition is only calculated when an 
event actually occurs. The code gives the same results of the synchronous simulation 
that we have already reported in Table 9.2 and Fig. 9.7. 

For simple programmes like these, the time gain is not important, but for complex 
models, in which the execution time of the programmes can be even in the order of 
hours, a gain of this magnitude is decisive. 

You can now modify the code and complicate its structure, trying to describe 
a more realistic model by directly exploring how it is possible to study complex 
systems in a very simple way. For example, different types of service requests 
(breakdown, maintenance, etc.) can be introduced, with different intervention times, 
the availability of spare parts could be considered, different repair times can be 
assigned to each worker and so on. The inclusion of these details in an analytical 
model makes the problem quickly intractable, while, with a simulation code, it is 
possible to add new details in a modular way without complicating both the structure 
and the management of the model. 


9.6 Kolmogorov-Smirnov Test 


Here we complete the hypothesis testing topic, begun in Chapt.7, with the 
Kolmogorov-Smirnov (KS) test, which gains simplicity and clarity if is explained 
with the support of simulation techniques. The KS test is non-parametric and holds 
for continuous variables. 

Given a sample x; (with i = 1,2,...,7) of n values of the variable X sorted in 
ascending order, an empirical partition function is defined as: 


0 ifx <x, 
k , 
Fy(x) = 4 — ifxe <x < xR - (9.27) 
n 
1 ifx > x, 


This function is constant between two consecutive points and increasing, discontin- 
uously, by 1/n at every point. The function F;,(x) is an unbiased estimator of the 
cumulative function F(x), as we have already shown in Eq. (8.89). Furthermore, 


9.6 Kolmogorov-Smirnov Test 389 


cumulative distributions 


Fig. 9.9 In the Kolmogorov-Smirnov test a sampled cumulative distribution (step function) is 
compared with the expected distribution F(x) (continuous curve). D is the greatest distance 
between the two distributions 


each point of the cumulative is strongly correlated to the other points, being their 
moving sum. The D estimator of KS is based on these principles and is simply: 


Dn = sup |Fn(x) — F(x)| . (9.28) 


—00<x<+00 


To find the maximum of this difference, we need to explore all the x values of the 
density support, as shown in Fig. 9.9 

Let us briefly recall the properties of this estimator. First of all, it can be shown 
that F;, almost certainly converges to F(x) (see Eq. 2.74) [MGB73]. This property 
is known as the Glivenko-Cantelli theorem. 

The second very important property is the independence of the D distribution 
from F(X). This follows directly from the cumulative variable theorem 3.5, 
according to which we know that if X ~ p(x) and F(x) is the cumulative of p(x), 
then F(X) ~ U(O, 1). We can then write the relation: 


#G:Xjsx) AG: F(X) SFO) #7: Uj < F)) 
F,(x) = 2S = eee 
n n n 
that, in words, means: the fraction of sample values less than a certain value x is 
equal to the fraction of values F(X;) < F(x), since F is an ascending monotone 
function. We also know that F(X ;) = U; ~ U(0, 1), hence the last relation. Based 
on this property we can write the KS estimator as: 


Dye wap || 22 EF _ 25 
n 


—00<x<+00 


#(j :U; <u) 
——_—_—— - u 


n 


= sup 
O<u<l 


(9.29) 


390 9 Applications of Monte Carlo Methods 


from which it follows that the distribution of D can be obtained by randomly 
extracting n uniform variates and finding the value of the maximum difference 
between the fraction #(U; < u)/n and u for u ¢€ [0, 1]. We therefore deduce that 
the distribution of D, is non-parametric, universal and depends only on n. 

The third property is related to the p.d.f. of the D statistic, whose properties have 
to be known for the p-value calculation to be used in the test. In fact, the concept 
at the base of the test is that F;, (X) follows the KS statistic (and will not give small 
p-values) when F(x) is the true parent cumulative distribution of the data. The 
determination of the true form of this p.d.f. is still an open problem, but there are 
many empirical solutions. The first of them was proposed in 1933 by Kolmogorov 
himself [Kol33] during a stay in Rome. He noticed, in a brilliant way, that the 
fluctuations of F,(X) around F(x) are the same as in certain types of Brownian 
motion around zero. Fortunately, as we have just seen, the distribution is universal 
and depends only on n, so it can be found empirically. In [PFTW92] the formula for 
the p-value calculation corresponding to an observed value d is reported as: 


P{Dy > d} = Ox(A) =2(—)4! exp(—2j727) , (9.30) 
j=l 


i= (vi+ou2+ “aa. 
Jn 
This approximation gives good results already from n > 4. 

In R, this test can be performed using the ks.test routine, which requests 
the data vector x and the name of the distribution function as input. Our 
routine MCKolmoDist (nsim, ndata, type) simulates nsim variates of 
D, from a sample of ndata data, Gaussian (type=’pnorm’ ) or uniform 
(type='punif’). The results are shown in Fig. 9.10, from which we can see the 
goodness of the approximation given by Eq. (9.30). 

In addition to the test between a sample and a distribution, it is also possible 
to compare two samples, because also in this case the KS test maintains the 
fundamental property (9.29). The changes to be made to the reference distribution 
under Ho (the two samples come from the same population) are minimal: if n; and 
n2 are the elements of the two samples, just set: 

nyn2 


n= 9.31 
ny+n2 ( ) 


in Eq. (9.30). The R routine ks . test performs also this test if the two experimen- 
tal samples are given in the calling string. 
The pros and cons of the KS test are as follows: 


(a) Pros: the KS test is more powerful than the x2 test because the latter does 
not detect anomalies when the points are above or below the model curve in 
a non-random way but still within the experimental error. This generally does 


9.6 Kolmogorov-Smirnov Test 391 


3 Pi 
coc © > 
° E=! 
5 Ss 
c i=) S 
lo} 
2 8 5 
> 3° 
= 
o 
5 3 8 
2 5 
€ ° 
— fo} i= 
£& 8 
Oo 
T T T T T T T 
0.10 0.20 0.30 0.40 0.10 0.20 0.30 0.40 
max diff. max diff. 


Fig. 9.10 Distribution of D, for n = 25. To the left: p.d.f. of D, obtained by calculating 
differences using Eq. (9.30) (full curve) and from a Gaussian sample (histogram). To the right: 
approximation from Eq. (9.30) (dashed curve) and cumulative function of the simulated data. The 
result does not change if, for example, the uniform distribution is used in the simulation. The 
abscissa of the intersection point between the horizontal line and the cumulative gives the 90% 
quantile value, corresponding to p = 0.10 


not happen with cumulative data. Furthermore, the distribution is independent 
of the type of density considered. 

(b) Cons: the test can only be applied to continuous data. Hence, it cannot be used 
for histograms or discrete distributions. 


This last point can be partially overcome with simulation techniques, which allow 
us to find the reference distribution even for histogrammed data. In this case, one 
has to be careful to retain the same bin number and the sample dimensions equal of 
the experimental data. As before, simulation allows us to use the good properties of 
the cumulative distributions. 

For instance, our MCKolmoHist routine finds the reference distribution from 
two simulated Gaussian or uniform histograms with n; and n2 events, respectively, 
and the same number of bins. 

The bin content of the cumulative histograms is n(x)/n, where n(x) is the 
number of simulated events inside the bin with mean value x and n is the total 
number of events. Then, the difference is calculated as: 


Mimi) a ny(xi) 
TKnjny = sup rn Sarre ; (9.32) 


I1<MK<K | ;_} i=l 


After repeating the cycle a large number of times, one gets the graphs of the density 
and the cumulative function of Txnjn,. In Fig. 9.11, the simulated distribution of 
10,000 differences is shown and compared with the same function as in Fig. 9.10. 


392 9 Applications of Monte Carlo Methods 


kolm. density function 
Kolmogorov cumulative 


T T T T 
0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 


max diff. max diff. 


Fig. 9.11 Distribution of 10,000 differences Tx,,,, from Eq. (9.32) for ny = nz = 200. To the 
left: p.d.f. obtained by difference from the approximation (9.30) (full curve) and H,, distribution 
obtained from two homogeneous Gaussian samples (histogram). To the right: approximation (9.30) 
(dashed curve) and cumulative function of the simulated data. The simulation clearly shows the 
deviation of the histogram data from the Kolmogorov-Smirnov model 


From this figure, we can clearly deduce that the p-values of a real experiment must 
not be calculated with the Eq. (9.30), but directly from the histograms simulated 
under Ho. Our routine also allows the generations of histograms from the uniform 
distribution. You can check that, with the same total number of events and channels, 
the simulation results are very similar. However, this is a property that must be 
verified on a case-by-case basis. 


Exercise 9.1 
Generate 20 variates from the Gaussian N(u = 0.5, o* = 4), and perform 
the KS test in R with a standard Gaussian and with the correct one. 


Answer The R code is: 


> X<- rnorm(20,mean=0.5,sd=2) 
> ks.test(x,’pnorm’) # one obtains p=0.011 
> ks.test (x, ‘pnorm’ ,mean=0.5,sd=2) # one obtains p=0.82 


from which we see that the first test gives, as expected, a small p- 
value, while the second, with the correct density, gives a two-tailed p-value 
corresponding to 0.41 for each tail. The test therefore tends to reject the first 
hypothesis and accepts the second one. 


9.7 Metropolis Algorithm 393 
9.7 Metropolis Algorithm 


The Metropolis algorithm is a sophisticated method to generate a sample from 
distributions that cannot be easily simulated with the techniques described in the 
previous chapter. It is best applied to functions that can be written as: 


_ A) 
D(x) = a : 


where x is a d-dimensional random vector. Due to the normalization conditions, 
we have Z = )°, A(x) in the discrete case, and Z = f h(x) dx in the continuous 
one. The normalization constant Z is must be known to obtain any quantity related 
to p(x) (such as mean, variance and percentiles). However in some cases its 
calculation may be impossible in practice. In physics or chemistry, this happens, 
for example, when systems consisting of a large number d of identical elementary 
components, such as molecules in a gas or in a crystalline solid, are studied. 
Typically, d ~ 1073, a value that roughly represents the number of atoms contained 
in a cm? of matter. 

In this case, x is a set of parameters (position, velocity, etc. ) describing the 
behaviour of all the elementary system components, and we suppose that each 
of them can assume k different states. If g(x) describes a macroscopic system 
parameter (such as temperature, pressure, magnetic moment, etc. ), the calculation 
of its mean, pan g(x) p(x), would require to evaluate h(x) for each of the possible 
k@ system configurations, an effort that is out of reach with the currently available 
computing resources. 

In these cases, the Metropolis algorithm is very powerful since, to generate a 
sample from p(x), it is not necessary to know Z nor to evaluate h(x) for all values 
of x. It is only needed to generate a sequence of simulated states whose asymptotic 
frequency distribution tends to p(x). 

So, let us imagine a system that is initially in a state x, sampled from the 
density p(x), and that can afterwards “migrate”to another state y according to an 
arbitrary transition probability t(x — gy). Systems in which these probabilities 
depend only on the current and the previous states are called Markov chains and 
are of fundamental importance in the study of many stochastic processes [RC99]. A 
sufficient condition for the chain to converge to a state distributed as p(x) is that it 
stabilizes in an equilibrium situation, where each transition occurs with probability 
equal to the inverse one. One of the conditions for this to happen is given by the 
so-called detailed balance equation: 


P(X)t(x > y) = p(y) tly > x), (9.33) 


where the term to the left (right) indicates the probability that the system evolves 
from x to y (from y to x) 


394 9 Applications of Monte Carlo Methods 


The arbitrary function t(x — y) is usually written as: 


t(x > y) = qm, y)a(x,y). (9.34) 


For each value x belonging to the spectrum of X, the auxiliary distribution g(x, y) 
is required to be a probability distribution on the spectrum of X. In other words, 
for any x, q(x, y) => O for each value of y and ey q(x, y) = | in the discrete 
case or f g(x, y)dy = 1 in the continuous one. Another very useful requirement 
to speed up the simulation is the possibility of quickly generating values from the 
distribution g(x, -). The probability a(x, y) of accepting the value proposed by the 
auxiliary distribution is instead defined by the Metropolis algorithm to guide the 
evolution of the system towards increasingly probable states (where, e.g. the total 
energy, or temperature, or pressure, is minimal). In this way, it can be demonstrated 
that a stationary regime can always be reached where the macroscopic parameters 
of the system do not vary with time, and, then, also Eq. (9.33) is satisfied. 

The algorithm, proposed by Metropolis and co-workers in 1953 [MRR*53] and 
generalized by Hastings in 1970 [Has70], consists of N steps; if x“) is the state 
value generated at step 7, to obtain the next value, one proceeds as follows: 


Algorithm 9.1 (Metropolis-Hastings Algorithm) 


(1) Generate a value y from the auxiliary distribution g(x, -). 
(2) Generate a value & from the uniform distribution U (0, 1). 
(3) Compute the probability of acceptance: 


(9.35) 


(i) 
ws, 9) = in| h(y) q(y, x) i 


Ae) g(x, y) 


where the ratio inside brackets is known as acceptance ratio; 
(4) IfE < a(x, y), thenx"+) = y; otherwise setx"+) = x. Ifi < N return 
to step 1. 


It is easy to show that the values obtained with Eq. (9.35) follow a density that 
satisfying the detailed balance condition. Indeed, since from Eqs. (9.33, 9.34) one 
has: 


P(X) q(x, y)a(x, y) = ply) q(y, x) aly, x), 


taking into account Eq. (9.35), if p(y) g(y, x)/[p(x) g(x, y)] < 1, asymptotically 
one has a(x, y) = p(y)q(y, x)/[p(x) g(x, y)] and a(y, x) = 1; thus the identity: 


p(x) q(x, y) POMS: = ply) gy, x) 
P(x) q(x, y) 


9.7 Metropolis Algorithm 395 


is obtained. Hence, given an initial value x we obtain a sample (x) aaa a x6) ) 
by repeating N times the steps 1-4, without the need to know Z, as step 3 only 
depends on the ratio p(y)/p(x) = h(y)/h(x). 

But what kind of sample did we get? Obviously, it is not a random sample, since 
the generation of x“*+!) depends on x“), Furthermore, the initial value x has little 
to do with p(-), which it is usually arbitrarily picked. 

If we choose to collect the values of x; for i greater than some value such 
that sample parameters (usually mean and variance) stabilize, we obtain a sample 
well approximating the requested stationary distribution. The internal correlations 
between the different sample elements does not generally prevent the determination 
of the important parameters of the distribution. In fact, under simple assumptions, 
one can show [RC99] the validity of the following: 


Theorem 9.1 Jf q(x, y) > 0 for any x and for any y belonging to the X spectrum, 
the property 


N a) >, 8(x) p(x), discrete case, 
Pie BO) _ tex) = (9.36) 


f g(x) p(x) dx, continuous case, 


holds for any initial point x. Equation (9.36) remains valid even if the condition 
q(x, y) > 0 is not satisfied for all the (x, y) pairs provided that, for any set A 
of the spectrum with p(A) > 0, q(x, y) is such that A is reachable with positive 
probability starting from any x. 


In its simplest formulation, proposed in 1953 by [MRR*53], the algorithm is used 
with q(x, y) = q(y, x). In this case aw depends only on the ratio p(y)/ p(x). 

Often X is generated from the uniform distribution, within the support of p(x). 
This is the method used by our test routine MCmet rop, applied to the Gaussian 
case, which encodes the algorithm as follows: 


for(k in 2:N) { 
ifk] =k 
# sampling in +- ks sigma around the mean 
# y= mu-ks*xsigma + 2*ks*sigmaxrunif (1) 
y = runif (1,min=mu-ks*sigma,max=mu+ks«sigma) 


# Metropolis ratio between Gaussians 
alpha= exp( -0.5*( (y-mu)*2 - (x[k-1]-mu)*2 )/sigma*2 ) 
u= runif (1) 
x[k] = x[k-1]; 
if(u<alpha) x[k]=y 


sumk = sumk+x[k] 

sumk2= sumk2+x[k]*2; 

plotk[k] = sumk/k; # Metropolis for... 
plotk2[k] = sqrt (sumk2/k - plotk[k]%*2) ; # mean and sigma 


# continuous display of the mean 
plot (i,plotk,type='’p’,main=’mean’ , lwd=2) 


396 9 Applications of Monte Carlo Methods 


grid() 
} # end of Metropolis cycle 


The parameters mu, sigma and ks are given as input. 

It is very instructive to perform some tests with this routine; as an example, we 
suggest to solve Problem 9.7. 

The Metropolis algorithm then provides us with the estimator: 


Li X) _ yy (9.37) 
N 

for (g(X)). The variance of M is not yy Var[g(X)]/N?, because the simulated 

random variables are correlated. A method frequently used to circumvent the 

dependency is to split the simulated sequence (x), ...,x)) into consecutive k 

blocks of b elements each (with b and k such that kb = N): 


(x), 6 HO) OED OD) gi Dey a ., xk) : 


and to compute the sample mean g of each block: 


((g(x))™,..., (g(@z))™) . 


As the block size increases, non-consecutive blocks are increasingly distant (in 
terms of iterations) and therefore less and less correlated. It can be shown that 
also the correlation between consecutive blocks tends to cancel out as b increases, 
approaching the uncorrelated situation. Hence, it is natural to use the estimate of 
the sample variance of the sequence of the block averages to estimate the error of 
(g(x))@ ) This error is associated with a sample mean of b terms; therefore, to get 
the error associated with m = (g(x)), which is an average of N = kb terms, we 
must divide by k again, finally finding the batch means estimates (with a CL of 
95.4%): 


= 


T 
N 


(g(X)) € (g(@x)) + ea (g(x)) — (g(x))P (9.38) 


aa —1) 
The previous equation implies the validity of the Central Limit Theorem for the 
distribution of M given Eq. (9.37). This can be proved, under the hypotheses of 
Theorem 9.1, for aperiodic Metropolis algorithms, i.e. when there is no partition of 
the state space that is visited in the same sequence during the simulation. Once the 
number of iterations N has been set, b must be chosen to calculate the batch means 
estimate. This is a very complex problem; a commonly used practical rule is to set 
b= J/N (see, for instance, [FJ10]). 


9.8 Ising Model 397 


Fig. 9.12 A4 x 4 lattice of 
atoms of a crystalline solid 
with their spins (—1 dark 
colour and | light colour) and 
a nearest-neighbour 
interaction, denoted by the 
cross 


9.8 Ising Model 


Let us now discuss an application of the Metropolis algorithm to a well-known 
model, used by the German physicist E. Ising to explain some observed behaviours 
in the magnetization of materials.* 

The binary image formed by the 4 x 4 pixel> matrix of Fig. 9.12 represents a two- 
dimensional portion of a crystalline solid, where an atom, with its intrinsic magnetic 
moment (due to spin), is located on every pixel: we associate spin value —1 to the 
dark colour and spin value +1 to the light colour. 

Depending on the temperature, the interaction existing between the nuclear spins 
at the microscopic level defines the behaviour of the material at the macroscopic 
level, determining its ferromagnetic or antiferromagnetic properties. In Fig. 9.12 the 
simplest model of microscopic interaction is shown, in which the atom in the middle 
of the cross only interacts with its horizontal and vertical nearest neighbours. 

Under this approximation, the Ising model defines the energy of a ferromagnetic 
material with n x n atoms as: 


H(x)=-B > xixj, B>O0, 


ipinj 


where x = (X1,...,X,2), 4; 18 the spin of the i-th atom and the sum is over the 
nearest neighbours (i ~ /). 


4 This model has important applications also in fields completely different from physics, since 
it well describes the evolution of systems in which there are changes of state as a result of 
interactions between individuals. For example, it is used to study the social impact of new ideas 
and the dynamics of opinion in complex societies [KH96] and to predict the behaviour of financial 
markets [Voi03]. 


5 The term pixel comes from the contraction of the words picture element and indicates the smallest 
homogeneous unit constituting an artificial image. 


398 9 Applications of Monte Carlo Methods 


Without an external magnetic field, a probability for each configuration can be 
determined depending on the energy and on the temperature T as: 


_ exp {4 _ A(x) 
p(x) = 5, exp [22] = -a : 


The formula shows that low-energy configurations, that is, those with neighbouring 
atoms with the same spin, have a higher probability to be reached. 

In 1944, L.Onsager [Ons44] developed the exact analytical treatment of the Ising 
model in two dimensions, with the calculation of the expected number of atoms 
with spin | at a given temperature, a quantity needed to determine the total magnetic 
moment of the material. However, the more realistic three-dimensional model has 
not been solved yet, and we need to resort to simulation, which we describe in the 
two-dimensional case for simplicity. 

In this situation, the admissible configurations of all spins are Qn so it is not 
possible, except for very small n, to calculate all p(x) values and carry on the 
direct simulation as in Sect 8.4. We therefore use the Metropolis algorithm with 
an auxiliary distribution that, at each step, randomly selects an atom and proposes 
its spin change. It is quite easy to realize that the formula for this distribution is as 
follows: 


1 
x, = 7> 
q(x, y) 7) 


if x and y differ in the spin of an atom only, whereas g(x, y) = 0 in all the other 
cases. To easily calculate the acceptance ratio, we can notice that the energy of the 
system can be written as: 


H(x) = Bn (x) —n*(x)), 


where n* and n~ indicate the number of nearest neighbour atom pairs with 
concordant and discordant sign, respectively. Then, since g(x, y) = q(y,x), the 
acceptance ratio becomes simply: 


exp{B[(n*(y) — n*(x)) — (n(y) — 0 (x))I/T}. 


Now, if the auxiliary distribution has chosen the i-th atom, the differences in the 
exponent depend exclusively on the signs of the nearest neighbours of that atom. By 
denoting with n*(x;) and with n~ (x;) the number of nearest neighbours with the 
same sign of x; and with opposite sign, respectively, and taking into account that 
n- (xj) = 4—nt(x;), the acceptance ratio becomes: 


exp{2B[n* (yi) — n* (ai)]/T} = exp{2B[4 — 2n*(x;)]/T} . 


9.8 Ising Model 399 


Therefore, once an initial configuration has been selected (e.g. by randomly 
choosing the spin of each atom), the algorithm checks the spin orientations both 
of the atom involved in the change and of its nearest neighbours. If g(x“) is the 
number of atoms with spin | at the i-th step of the algorithm, the next term g(x“+))) 
used for the calculation of the mean (9.36) is unchanged, if the chosen atom does not 
change spin or, otherwise, is easily obtained by adding | or -1. From an algorithmic 
point of view, the index of the atom to be changed is obtained by generating a 
uniform variate € in (0, 1) and selecting the smallest integer i that exceeds E\n?. 


Given i, if a second uniform variate & is less than the acceptance probability 
a(x, y) = min fi; exp{26[4 — 2n* (xi)]/T}} 


a spin change occurs. This algorithm is implemented in our code MCising. 

We performed a simulation of the Ising model with N = 100000 iterations, 
B =0.3 and T = 1, starting from the configuration where all atoms have spin -1 ina 
lattice n x n with n = 20. Our goal was to determine the expected number of atoms 
with spin 1, so g(x) will give this value for the configuration x. In Fig.9.13 the 
dashed line shows the time evolution of the sequence {g(x“)}. The sample running 
mean (g(x)) calculated as a function of the number of iterations is displayed with a 
continuous line. Some main features can be easily noted: 


(1) After a few thousand iterations, the sequence { g(x)} stabilizes and begins to 
oscillate around its presumed expected value. 

(2) The plot of the sample mean instead tends to converge to a constant value, 
which, according to Theorem 9.1, is (g(X)). 


Fig. 9.13 Result of the 


fo} 
Metropolis simulation for the a 
Ising model: the number of 
atom with spin | (dashed fo) 
line) and the sample mean of 87 
this parameter (solid line) are = 
shown as a function of the ao 
number of iterations oe 
° 
— 
BS. 
ex 
=} 
i 
5 | 
To) 
[o) 


T T T T T T 
0e+00 2e+04 4e+04 6e+04 8e+04 1¢+05 
iterations 


400 9 Applications of Monte Carlo Methods 


(3) Since the simulation starts from very low probability values that are quickly 
abandoned by the system, a distortion in the estimate of the mean is present, 
since the configuration with all spin —1 atoms will never be spontaneously 
reached during any finite length simulation. This situation gives the initial states 
a greater weight than it should. We have then to discard an initial number of 
iterations (e.g. the first 10,000) to reach a high probability zone and start to 
accumulate data from this point to compute (g(x)). 

(4) The amount of data to be collected can be evaluated using the plot of the 
sample mean. The simulation can be stopped when the mean oscillations have 
an amplitude lower than a certain threshold (which is subjective and depends 
on the aimed precision). 
The group size b for the batch means method must be increased until the 
autocorrelation of the series {(g(x)};>1 becomes negligible. As suggested at 
the end of the previous section, we set b equal to the square root of the sample 
size (for the details see again [FJ10]). This in turn may require increasing the 
number of iterations in order not to have a too small number of groups. Our 
example has precisely these characteristics. 


(5 


wm 


Considering Fig.9.13, if the first 10,000 iterations are discarded and the other 
90,000 are retained, the interval estimate (9.38), with b = 300 (i.e. 300 groups), 
is: 


200.9 + 2 x 1.1 = (198.7, 203.1) , 


which was obtained by applying our Bat chmeans routine to the sample sequence 
of the number of spin | atoms. 

We conclude with an important warning. We suggest you to perform, using our 
MCising code, additional simulations with temperatures T lower than |. You will 
find that, when the temperature gives B/T > 0.44, the plot of g(x) will move 
fairly quickly to one of the two modes of its distribution (i.e. g(x) = 0 or g(x) = 
400), without being able to evolve from one to the other. In this type of situations, 
although Theorem 9.1 still holds, it is practically impossible to carry out the number 
of iterations necessary to visit the state space regions of greatest probability; hence 
the sample mean of the plot is by no means a good estimate of the true one. 


9.9 Definite Integral Calculation 


The numerical computation of the value of a definite integral, one of the best known 
and most widespread applications of the MC methods, is also a typical example of 
the use of simulation techniques for problems that, at first sight, would seem not to 
allow a statistical approach. 

As we will see shortly, with the MC methods it is convenient to solve multi- 
dimensional integrals, where the other numerical methods present some relevant 


9.9 Definite Integral Calculation 401 


application problems. In the following, however, to compact and simplify the 
notation, we will consider only the single-valued functions, bearing in mind that 
the whole discussion can be very easily extended to the case of functions of many 
variables. 

There are two different fundamental approaches that can be used to calculate the 
definite integral: 


b 
fa} f(x) dx (9.39) 


using random numbers. 

The first method, called hit or miss, is based on the geometric interpretation of 
the value of a definite integral as a measure of the area under f(x) and within the 
integration interval [a, b] (dashed area of Fig. 9.14). By exploiting, from a different 
point of view, the same properties applied in the rejection method, we can in fact 
determine J by multiplying the area of a rectangle enclosing f (x) by the probability 
that any point P inside the rectangle is also inside the area under f (x) (hatched area 
of Fig. 9.14). Recalling relation (8.41), we have: 


hatched area I 
p= —  _. = —— _ (9.40) 
area of the rectangle ABCD h- (b—a) 
so that: 
I=pA, withA=h-(b-a). (9.41) 
If we randomly generate N points (x1, y1), (x2, y2), .--, (tN, yn) uniformly 


distributed inside the rectangle ABCD and count the number Ns of “successes”, 
i.e. the number of times in which y; < f(x;), an approximate evaluation of p is 


Fig. 9.14 Graphic y 
representation of the hit or Cc D 
miss method: the integral 

(9.39) is evaluated using the 


\ 
proportion of random points 

uniformly distributed inside 

the rectangle ABCD that also f(x) 


“hit”the dashed zone. The 
rectangle has basis (b — a) 


mahi = Be mninm ~~ N 


402 9 Applications of Monte Carlo Methods 


obtained using the ratio Ns/N (see Fig. 9.14). The estimate of the integral (9.39) 
then becomes: 


N 
I= pA~ pyA= A = Ih. (9.42) 


This value should be considered as the realization of a new “Hit or Miss’’statistical 
variable I ~ M To derive its error, it is sufficient to observe that Ns follows a 
binomial distribution with mean Np = NI/A and variance Np(1— p); we therefore 
get: 


a fAaT.. a= 
Varl Ix?) = — Var Ns] = cae Ty (A ~ In”) , (9.43) 
N N N 


The second MC integration method, known as crude Monte Carlo, considers 
instead x as a uniformly distributed random variable within the integration interval 
[a, b]. Recalling Definition (2.68) of the expected value of a function of random 
variable, we can write the identity: 


l=(b of a o- ia = (f(x)) (6-4), (9.44) 


with (f(x)) equal to the mean of the values assumed by the integrand function on 
La, b]. 

One of the ways to roughly evaluate ( f (x)) is to calculate the average of N values 
F(x), f (x2), ---, f(an), with x1,x2,...,xy randomly and uniformly sampled 
within [a, b]: 


(b—a) 
I~ 7 ric wae (9.45) 


i=1 


The variance of the random variable I Fg is easily obtained from the properties of 
the variance operator defined in Sect 2.9: 


Var EM] = o7y = ——— “vf f(X; |=4 = ene" val f(Xi)]. (9.46) 


Since f (X;) can be considered as a function of the random variable X;, taking into 
account Eqs. (9.44) and(2.49), one gets: 


b b 2 
M 1 2 
vat = <]o-a f f «dr ( | f(s) ax) 


oo [Ee w-Z( ooo) iF (9.47) 


9.9 Definite Integral Calculation 403 


The term within square brackets is a measure of how much f(x) differs from its 
mean value in the integration region, so Var[I na ] strongly depends on the codomain 


of f(x). 


Exercise 9.2 
Compute, using MC, the integral: 


5 
I =| Vsinx dx , (9.48) 
0 


Answer Despite the apparent simplicity, this integral is not analytically 
solvable. The exact solution, obtainable only numerically, is: 


f= I Vsinx dx = 1.19814... (9.49) 
0 


We apply the two MC integration methods previously described using our 
MCinteg routine. 


(a) Crude MC method of Eqs. (9.45, 9.47) 
By generating 1000 random points, we obtained J = 1.199 + 1.2- 107, 
as given in the first row of Table 9.3. As we have already noted previously, 
if you want to obtain a very precise solution (e.g. within a few per 
thousand), you need to generate a large number of points, given the low 
convergence speed of the MC result to /. 

(b) Hit or miss method of Eqs. (9.42, 9.43) 
In this case we have: h = 1, (b — a) = 7/2, and, with 1000 random 
points, we obtained J = 1.2024 2.1- 10-2, a result with an error about 
two times the previous one. 


In the MC framework, one defines as efficiency 7 of an algorithm the quantity: 
n = 1/(tnvar[Sw]) , (9.50) 


in which Sy is the integral value given by the algorithm and fy the time 
needed for its computation. In our example, since ty of the two procedures is 
about the same, with the hit or miss method, we have decreased the efficiency 
of the numerical integration programme by about four times. 

This difference obviously depends on the particular integral considered, 
but it is easy to demonstrate (see, e.g. [Jam80]) that the hit or miss method 
always gives a less precise result than the crude MC. This fact can be intu- 
itively understood by observing that, in the first method, for each generated 
point x;, the value | is added with probability f(x;), instead of adding the 
corresponding value f(x;). In this way, an estimate is used instead of the 
exact value and, therefore, an additional error is introduced. 


404 9 Applications of Monte Carlo Methods 


Table 9.3. Comparison between the estimates of the integral (9.48) obtained by the generation 
of 1000 random points with our MCinteg routine. The different algorithms are explained in the 
text 


Algorithm In Oly 

Crude MC 1.199 1.2- 10-7 
“Hit or miss” 1.202 2.1- 107? 
Importance sampling 1.1987 2.5 - 1073 
Stratified sampling 

(a) Proportional 1.1981 1.0- 10-3 
(b) Optimal 1.19785 6.8- 10-4 


9.10 Importance Sampling 


The method of Eqs. (9.45, 9.47) is called “crude’since, for a given number of 
samplings N, a better precision can be achieved with more sophisticated techniques. 
This result can be obtained, according to Eq. (8.6), by reducing the variance o7 of 
the distribution of the simulated variable 7;. 

This operation becomes relatively easy in the integration as oy coincides with 
the standard deviation of the integrand function (see Eq. 9.46), whose manipulation 
is generally quite simple. 

For example, while discussing Eq. (9.47), we noted that the greater the variation 
of f(x) in the integration region, the greater will be the error on the result that, 
conversely, becomes more precise when the generated values of f(x) are not too 
dissimilar to each other. Then, the MC estimate of J can be made more precise 
when a positive integrable function g(x) is found such that g(x) ~ f(x), which 
allows us to rewrite J as: 


b b b 
I= / f(x) dx = I) 3) dx = / FQ) dG(x), (9.51) 
a a g(x) a g(x) 
with: 
G(x) = / g(x) dx. (9.52) 


If we assume that g(x) is also normalized within [a, b], it is easy to conclude 
that Eq. (9.51) represents nothing more than the mean of the random function 
Ff (X)/g(X), where X is a random variable with density g(x), according to 
Eq. (2.68). We then can write J as: 


T= (2) : (9.53) 


9.11 Stratified Sampling 405 


If we consider a random sample X\, X2,..., Xx from the density g(x) we obtain: 


F(X) _ yrs 


Yaw : 
rCaiene 


7 X ~ g(x). (9.54) 


rs 


Instead of uniformly generating x to integrate f(x), a random variable distributed 
as g(x) is generated to integrate f(x)/g(x), thus giving more weight to the more 
“important’parts of f(x) (hence the name of importance sampling given to this 
algorithm). The final variance now depends on the ratio f(x)/g(x), and, recalling 
Eq. (9.47), it is given by: 


2 
var IES] = a| f 52 Fe) 5 AG ex )- 7] = Al Ban xP), (9.55) 


or, in an approximated way, as in Eq. (9.47): 


Is | Pay Alsi | 
var[ 14°] ~ wold FG) 7 al (9.56) 


We recall that in both previous equations, x is sampled from g(x). 

From Eq. (9.55) we immediately notice how the variance becomes zero if 
g(x) = f(x)/TI. Unfortunately, in real problems, we will never be able to apply 
this equation since it is necessary to know J, which is exactly the solution of the 
problem. However, Eq. (9.55) quantitatively demonstrates that g(x) must be chosen 
as much as possible similar to f(x). This choice ensures that the ratio between 
the two functions varies inside a limited range within the integration region, thus 
maximizing the gain in precision. 


9.11 Stratified Sampling 


The idea behind the stratified sampling, a well-known technique used in statistics, 
is similar to the one we have just described: a greater number of points are 
concentrated in those areas that are more important for the calculation. The 
difference is now that, instead of changing the integrand, the integration region 
is divided into different subintervals; then points are sampled uniformly but with 
different densities, depending on the particular considered interval. 

Thus, the interval [a, b] is divided into k segments defined by the points a = 
ago < a < a2 <... < a, = b. The width and the sample size of the j-th 
subinterval are denoted by A; = (a; — aj-1) and Nj, respectively. For the well- 
known integral properties, we can also write: 


406 9 Applications of Monte Carlo Methods 


k 
l= [ f(x)dx = rf * esa =i (9.57) 


Oj-1 j=l 


In stratified sampling, each interval 7; which appears in the right term of the 
previous equation is approximated with the crude method: 


N; Nj 
“7 Aj Aj 
Ij= [ f(x) dx ~ Ny, d, f(aj-1 + Aj&j) = Ny; a f (xij) , (9.58) 


j-1 


where x;; is the i-th point sampled from the uniform distribution within the j-th 
subinterval). The J estimate then becomes: 


k k Nj 


= 243 (F/O) = VL row = 16 (9.59) 


j=l j=l i=! 


where (f j(X )) is the mean value of f(X) in the j-th subinterval. The global 
variance, which is nothing more than the sum of the variances of the single J; taken 
separately, is given by Eq. (9.47): 


A’ 
var IC 5] = ae f= y{4 if Pevac—| f" reves) | 
J 


j=l oe j=l 
(9.60) 


(o7 is the variance of f(x) in the j-th subinterval), or, in an approximated way: 


Nj 2 
1 
alin |= > {oe (xij) — xD few| I (9.61) 
a ae Er 


j=l 


The error on the final result now depends not only on the behaviour of f(x) but 
also on the way in which the integration domain is divided and on how the points 
are distributed within each subinterval. Once the subintervals A; are fixed, with a 
simple but rather involved procedure, one can derive from Eq. (9.60) that the better 
choice for N; is given by the rule [Coc77]: 


Pa ek (9.62) 
j= : 


Yo Ajo; 
j=l 


9.11 Stratified Sampling 407 


As intuitively expected, this equation prescribes to concentrate the random gen- 
eration in the largest subintervals and in those with the greatest variations. Even 
in this case, however, this result is not directly applicable, as the o; values are 
obviously unknown a priori. To use the previous formula, a short preliminary test is 
usually performed to obtain a fairly correct estimate of o;, from which to derive an 
appropriate value for N;. In doing this, a reasonable compromise must be made 
between the required calculation time and the increase in precision that can be 
obtained on the final result. 

When this procedure is too long or complicated, it is possible to demonstrate 
(see again [Coc77]) that the best way to proceed is to generate a number of points 
proportional to the length of each subinterval: 

Aj 


Np Nae (9.63) 


This property can be intuitively understood if we observe that with the stratified 
proportional sampling, the uniformity of generation of random points is improved 
compared to the crude method, thus reducing the statistical fluctuations due to a 
relevant increase of their density in specific zones of the integration interval. 

The subinterval lengths can instead be optimized only when the points N; are 
chosen on the basis of the Eq. (9.62), and the integration domain is simultaneously 
divided into a large number of subintervals (see again [Coc77]), conditions that are 
not always satisfied in practice. Otherwise, there are no strict prescriptions; usually, 
to simplify the programmes, the subintervals A ; are all taken with the same length: 


_ b=a) 


Aj i 


Vj=l,...,k. (9.64) 


Exercise 9.3 
Calculate integral (9.48) with the importance and stratified sampling tech- 
niques. 


Answer (a) Importance sampling 
Since, in a neighbourhood of zero, the function sin x can be expanded 
as: 


ee 


(continued) 


408 9 Applications of Monte Carlo Methods 


Exercise 9.3 (continued) 


we can try to approximate the integrand function (Vsin x) with h(x) = 
./x, as shown in Fig. 9.15. Hence, we choose this form for g(x) that, after 


normalization, becomes: 
3 2 
a(x) = =f —vx, (9.66) 
uVa« 


giving the cumulative function: 


3/2 


G(x) = iL g(x) dx = (=) ; (9.67) 
0 IU 


Equation (8.12) results in: 


x=F(ey?. (9.68) 
The x variates generated with this formula are used to calculate the 
integral according to Eq. (9.54). This algorithm is performed by our 
MCinteg routine. 
The result with N = 1000, reported in the third row of Table 9.3, clearly 
shows an improvement in precision of about an order of magnitude 
compared to the crude method. Since the computing time is roughly 
the same for the 2 algorithms, the gain in efficiency is nearly around 
100 ! However, to successfully apply this method, g(x) must be easy 
to sample from (recall that its cumulative must be known), even when 
the behaviour of f(x) is complicated. Moreover, in the multidimensional 
case, to minimize the computation time, it is definitely preferable that 
G(x1, X2,..-, Xn) is a separable function: 


G(x1, X2,..., Xn) = g(%1) - g(x2)-...- 8(xn) . 


The classes of functions satisfying all these conditions are a few, and 
essentially: trigonometric, exponential, low order polynomials and some 
combinations of them. 

Stratified sampling 

We apply this algorithm using the uniform stratified sampling technique. 
We then consider equal subintervals and generate the same number of 
points in each of them. By dividing the integration domain into 20 


(b 


wm 


(continued) 


9.11 Stratified Sampling 409 


Exercise 9.3 (continued) 

subintervals and generating 1 000 points, we obtained the value reported 
in the fourth row of Table 9.3. Compared to the crude method, we note 
an improvement in the standard deviation of a factor of ~ 12, which, 
even taking into account an additional calculation time of about 30%, 
corresponds to an increase in efficiency of more than two orders of 
magnitude. 

With 10,000 points and 100 layers, we obtained a very accurate result: 


IT = 1.198154 + 0.000023 . 


With the same total number of points and subintervals, the precision 
can be further increased if the number of points in each subinterval 
is determined with the optimal method of Eq. (9.62)), as shown in 
the last row of Table 9.3. This algorithm is present in our MCinteg 
and MCintopt routines; the last one applies the optimized stratified 
sampling to an input function. 


In addition to the methods presented here, there are several other variance 
reduction techniques; those interested can consult [Kah56, KW86, Rub81, Rip86]. 


Fig. 9.15 Comparison uN 
between the functions Y 
f(x) = Vsinx and 

h(x) = ./x in the interval 


ee saevx 


Z f(x)=V sin(x) 


410 9 Applications of Monte Carlo Methods 
9.12 Multidimensional Integrals 


Up to now, we have shown how it is possible to improve, even significantly, the 
precision on the MC estimate of the value of a definite integral by using variance 
reduction techniques. But, despite this progress, MC methods, for one-dimensional 
functions, are always less efficient than the standard numerical approximation 
procedures, which converge as N~* with k > 2. 

However, this situation changes when we consider the multidimensional case, 
as the error of the MC methods is independent of the dimensionality d of the 
integral, while the precision of the other numerical techniques varies as N hie 
with k > 2 (a process requiring N points in one dimension will then need to get 
the same precision, N* points in two dimensions, N° points in three dimensions 
and so on). Furthermore, all the other methods assume a smooth polynomial 
behaviour of the integrand function, and, therefore, their use is questionable in case 
of discontinuities. For all these reasons, MC methods start to be competitive for five- 
dimensional integrals, becoming the only ones actually applicable in ten or more 
dimensions. 

In case you have to compute for the first time the multidimensional integral: 


pf F(x, .--,X%q) dx... dxg (9.69) 
2 


with MC methods, we suggest you to try to use some existent software. There are 
several reliable codes available on the market, which exploit the variance reduction 
techniques described before (see, e.g. [PFTW92]). In these codes, to make the 
algorithms simpler and more reliable, the integration is carried out on a domain 
having independent integration limits (a hypercube or a hyper-rectangle) obtained 
by performing an appropriate change of coordinates. If this transformation turns out 
to be too complex, it is often preferable to geometrically consider the integral (9.69) 
as the volume of a solid W in the space with d + 1 dimensions (x1,..., xg, y) and 
apply the hit or miss method introduced in Sect. 9.9. 


9.13 Problems 


9.1 Often the maximum deviation is used in quality controls: a characteristic 
parameter X, considered as a normal random variable, is chosen, and the difference 
between the maximum and minimum values of X found in a control lot of n 
elements is examined. Knowing that the average production deviation of X is 
o = 0.5, write a simulation code to determine the value A = xXmax — Xmin above 
which to discard the production with CL = 99% for a control batch of n = 100 
elements. 


9.13 Problems 411 


9.2 A device is formed by five components, according to the scheme shown in the 
figure: 


The system operates if the “workflow’goes from A to B, that is, if at least one 
of paths (1,3,5), (2,3,4), (1,4) e (2,5) is working. Determine, using a simulation, 
the average operating time of the device, knowing that the failure time of each 
component is a negative exponential random variable with a mean value of 
(t1) , (t2) , (ta) , (ts) = 3 days and (f3) = 5 days, respectively. 


9.3 Modify the integration routine MCinteg (used in the Exercises 9.2 and 9.3) to 
calculate the integral of the standard Gaussian: 


(1//2z) 7 exp(—t?/2) dt . 
0 


Compare the results with those of Table E.1 for various x values. Exclude the 
importance sampling technique. Try also to use our routine MCintopt. 


9.4 Using all the MC methods discussed in the text, calculate, possibly modifying 
the MCinteg routine, the integral: 


1 
rah log. +x) dx. 
0 


9.5 Calculate with the MC methods the integral: 


1 1 1 
i / / G? + x + a) dx; dx2 dx3. 
-1/J-1J-1 


9.6 Compute, with the hit or miss method, the area of the ellipse: 


7 2 
og a i, 
77% 


9.7 Implement a Metropolis algorithm to estimate the mean and variance of a 
standard Gaussian distribution, using a simulation of length N. Set U(—a, a) as an 


412 9 Applications of Monte Carlo Methods 


auxiliary distribution. Then, extract a sample of size N from a standard Gaussian 
with the rnorm routine of R. Compare the trajectories of the estimators as N 
increases with a = 2. Repeat the experiments with different values of a: is the speed 
of convergence of the trajectories of the mean and variance estimators influenced by 
a? Is it reasonable to use a > 3? 


9.8 A detector has an entrance window of radius Rg = 3cm, and a plane 
radioactive source, with sides (x2 — x1) = 4cm and (y2 — y,) = 6cm, is placed at 
a distance of h = 5cm from it (see figure). The source emits particles isotropically 
with angles (0, 6) from evenly distributed points (x, y). Calculate the geometric 
efficiency of the detector (i.e. the probability for an emitted particle to enter the 
detector) using the configuration proposed in the figure. 


particle 


detector 
window 


Y2 


radioactive source 


9.9 The Von Mises distribution has density: 


A= 1 ef 8) _ ap cy <r, 
2m Ip(c) a 


where Jo(c) is the zero order modified Bessel function of the first type. Use the 
Metropolis algorithm to generate a sample from this distribution without calculating 
Io(c). 


9.10 Generate various samples from the Ising model with 6 = 0.3 and T = 1, 
increasing step by step the dimension n x n of the lattice with the number N of 
iterations fixed at 100,000. On the basis of the plot of the number of atoms with 
spin = 1, what is, in your opinion, the maximum value of n for which the algorithm 
is working correctly? 


Chapter 10 M®) 
Statistical Inference and Likelihood Cheek for 


In speaking of the most probable consequence, we must 
remember that in reality the probability of transition to states of 
higher entropy is so enormous in comparison with that of any 
appreciable decrease in entropy that in practice the latter can 
never be observed in nature. 


L.D. Landau, E.M. Lifchitz and L.P. Pitaevskij, 
“STATISTICAL PHYSICS”. 


10.1 Introduction 


In Chaps. 6 and 7, we introduced estimation theory and hypothesis testing in the 
context of elementary statistics: from a data sample, an estimate of a statistical 
parameter (mean, probability, correlation coefficient, etc.) with its error and the 
related confidence interval is determined using a data sample. Afterwards, if 
necessary, the compatibility of this estimate with a model, generally called the null 
hypothesis Ho, is verified by means of x? or other tests. 

In this scheme, which can be defined “static”, the model to be checked is given a 
priori and is not modified by the information coming from the collected sample. 

However, a more “dynamic” and efficient approach can be adopted, which 
consists in determining the most likely model of the parent population on the basis 
of the collected data. In this case, the estimation intervals (which, as we know, may 
depend on the population model assumed) are then modified accordingly, before 
performing the test between data and model. The scheme is that of Fig. 10.1, where 
the bold arrows indicate the differences with respect to the static model considered 
so far. 

The methods that determine the most likely population model from a set of data 
by improving the parameter estimation are known as best fit methods. In practice, 
the optimization of the model from the data is usually achieved assuming the family 
of the density function (binomial, Gaussian, Poissonian, uniform or other) to be 
known and considering the characteristic parameters of the distribution such as 
mean, variance or existence limits, as free parameters. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 413 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per il 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_10 


414 10 Statistical Inference and Likelihood 


estimate 
data “a 
hypothesis testing 


(x? or other tests) 


most likely model 
(hypothesis H 9) 


Fig. 10.1 Parameter estimation and hypothesis testing. The bold arrows denote the steps of the 
model optimization 


In this chapter we will use the following notation: X is a vector of random 
variables, each of them generate a sample, according to the Definition 2.12. 
Therefore, we will refer to: 


X = (%1,X2,...,Xm) (10.1) 
as the observed values of X = (X1, X2,..., Xm) ina trial, and we will write that: 
Xj = (Xi, X2j, +--+, Xmi) 


are the occurrences of X in the -th trial. 
We introduce also a new notation, distinguishing between the variable X and a 
n-dimensional sample: 


X = (X1, X2,..., Xn). (10.2) 
The values assumed by the sample after the experiment (i.e. after 7 trials) are: 
X = (X1,%2,..-,Xn)- (10.3) 


Hence, we consider a probability space (S, F, Pg) depending on one or more 
parameters, according to Eq. (6.1). The density to be optimized is therefore of the 
type p(x; 0), where 0 = (6, 02,...,@») is a p-dimensional parameter. To optimize 
the density then means to determine the values of the 9 parameters which better fit 
to the collected data, having a priori fixed a functional form. 

Let us start with a single random variable X, and let x = (x1, x2,..., Xn) be the 
observed values in n independent trials. If p(x; 8) is the density of X (depending on 
a set of parameters 8), we can apply the law of compound probabilities to the case 


10.2. Maximum Likelihood (ML) Method 415 


of n independent trials carried out on the same variable, and, recalling Theorem 4.1, 
we can associate the observation with the probability density: 


LQ; x) = p(x; 8) p(xa; 0) +++ pO; 0) =] | pai: 6). (10.4) 
i=] 


The name given to this product, considered as a function of 6, is likelihood function. 
For any fixed 0, it represents, apart from the differential factors to be integrated, the 
probability to obtain the values x. 

For m variables, the likelihood function is generalized in an obvious way through 
Eggs. (10.1)-(10.3): 


L(O™; X) = p(x11, X21, ---,Xm13 9) P(X12, X22, .--, Xm25 )... 
n 


 P(XIns Xan. «+++ %mni 8) =] | pis), (10.5) 


i=1 


where the product is extended to all n experimental values obtained for each of the m 
variables X. The general definition of the likelihood function also includes the case 
of non-independent trials; we will take this possibility into account in Sects. 12.7 
and 12.8, when considering experimental data affected by systematic errors. As 
we will see, the mathematical properties of the likelihood function of interest in 
statistics are the same as its logarithm. It is then possible to eliminate the product in 
Eqs. (10.4), and (10.5) by defining the new function: 


£=-In(L@; x)) =— ) in(plai; 9)) , (10.6) 


i=1 


where the minus sign in front of the logarithm should be noticed. If this convention 
is adopted, a maximum of L corresponds to a minimum of L£. In the following, 
to simplify calculations, we often use, instead of the L function, its negative 
logarithm CL. 


10.2) Maximum Likelihood (ML) Method 


The maximum likelihood(ML) method for estimating the 0 parameters was intro- 
duced by R.A. Fisher in 1912. It can be stated as follows: 


Definition 10.1 (Maximum Likelihood Method (ML)) Given a set of observed 
values x = (%1,X2,...,X,) froma random sample X = (X,, X2,..., X,) with 
p.d.f. p(x; 6), where @ is a parameter varying in an set ©, the maximum likelihood 


416 10 Statistical Inference and Likelihood 


estimate 6 of @ is the maximum point (if it exists) of the likelihood function (10.5). 
Shortly: 


maxe [LO; x)] = maxe iT P(Xi3 — L(6; x). (10.7) 


i=1 


Alternatively, the principle requires minimizing the logarithmic function £ of 
Eq. (10.6). In this case, the minimum point of the function is easily obtained by 
solving the likelihood equations with respect to 6: 


sf pri: 8) 
aa Si eee ae ee 10.8 
a, Een 00; u ees 


Three points have to be emphasized to clarify the ML method: 


¢ Before the trial, the likelihood function L(@; X) is proportional to the p.d_-f. 
of (X,, X2,...X»). In general, there is proportionality and not coincidence 
because, in the maximization of the likelihood function, constant factors not 
affecting the maximum of 8 are sometimes omitted; 

¢ After the trial, during the likelihood minimization, the occurrences x of the 
variables X are used. The quantities x are then regarded as fixed; 

¢ After the trial the likelihood function (or its negative logarithm) only depends on 
the values @, which now are the variables with respect to which to maximize (or 
minimize). The maximum (or minimum) point @ is the ML parameter estimate 
obtained from the data. 


Now we apply the principle to some simple cases. 


Exercise 10.1 
Inn independent trials, x successes have been obtained. Find the ML estimate 
of the probability p. 


Answer We assign the binomial density (2.29) to the population: 


! 
b(n, p; x) = eae = pas 


Since here we have only one observed value, the likelihood function coincides 
with the binomial function, which must be maximized with respect to p, 
keeping fixed the parameters x and n. Since the factorial term does not contain 


(continued) 


10.2. Maximum Likelihood (ML) Method 417 


Exercise 10.1 (continued) 
p, we can neglect it during the maximization procedure. For simplicity, we 
minimize the logarithmic likelihood (10.6), which now becomes: 


£=-—xIn(p) — (n—x)In( — p). 
To find the minimum, we set the derivative to zero: 


dL x n-x 
— = -—-—- + = (0), 
dp p Veop 


that gives: 


ne BE 
p=—— ff, (10.9) 
n 
with the notation of Eq. (10.7). It is also easy to prove, from the second 
derivative, that this is the absolute minimum of the function. 


The result shows that the ML probability estimate is nothing else than the 
observed frequency. At the beginning of the book, we postulated the existence of the 
probability and noted the possible convergence of the frequency to it, as indicated 
in Eq. (1.3). Alternatively, we could have taken the ML method as a starting point 
and deduce Eq. (1.3) from this. In fact, Eq. (10.9) is a special case of this principle: 
all fundamental statistical estimators can be deduced starting from the ML method. 
In the following we will soon see other examples of this principle. The next two 
exercises show a likelihood function in the form of a density product. 


Exercise 10.2 
Two experiments give x; successes over n, trials and x2 successes over n2 
trials, respectively. Find the ML estimate of p. 


Answer Operating as in the previous exercise, we can write, apart from 


inessential constant factors, the ML function as the product of two binomials 
having the same probability p: 


It, = Dp! p? ad as jaye ad Fa jpye 2 ; 
Using logarithms, we obtain: 


L = —(x1 + x2) In(p) — (ny — x1 +22 — x2) Ind — p), 


(continued) 


418 10 Statistical Inference and Likelihood 


Exercise 10.2 (continued) 


and hence: 
d£ xy +x nj +n2)— x1 -x . x+x 
dex 2 2) 1 2 es ee 20 
dp Dp l1-—p ny +n 


The ML estimate is nothing else than the sum of the successes over the sum 
of the trials. 


Exercise 10.3 
Given n variates x; from a one-dimensional Gaussian, find the ML estimate 
of mean and variance. 


Answer Since we are dealing with n measurements, the likelihood function is 
the product of n Gaussians with the same jz and o parameters: 


1 
ss (/27 oy" e 


The logarithmic likelihood (10.6) is then: 


Sell CASSEL 
L(,0) TEE 


n 
_ # 2 1 : 2 
L(y, 0) = 5 InQ@r0*) + 5 Ie =) 


which, setting the derivatives to zero, provides the required estimates: 


ve i = 8 
— = 5) ) (Xj iL) = 0 — L= S= 
Oy I a 
ol n 2 A (xj; — m) 
— a = Os 
ea ee re a ee) hay, 


In the variance formula, the true mean pj has been replaced with the estimate 
f=m. 

Notice that the ML estimate of the mean coincides with the usual sample 
mean m, whereas the ML estimate of 0? gives the estimator (6.59), which is 


unbiased only asymptotically. 


When the likelihood function is differentiable with respect to 6, the ML method 
reduces to a differential analysis problem. However, the terms containing 6 can be 


10.2 Maximum Likelihood (ML) Method 419 


discrete, or, even if continuous, they could not admit continuous derivatives. It is 
therefore necessary to resort to finite difference methods or to problem-dependent 
considerations. The following exercise is an example of this type of issues. 


Exercise 10.4 
The sampling of a variable from the uniform density (3.79): 


— fora<x<b, 
u(x)= 4 b-a 
0 forx<a,x>b 


gives the n variates: 
X =X <X25*: < Xn-1 < X- 
Find the ML estimate of a and b. 
Answer Since the values have been sorted in increasing order, the condition: 
a=, b> Xn 


must apply. The likelihood function (10.5) becomes: 


1 
= ear a<x,j, b>x. 


In this case, even if we have continuous parameters, the maximum of the 
function cannot be found by differentiation due to its discontinuities. Since 
this specific likelihood is maximal when the denominator is minimal, the 


ML estimate of a and b coincides with the smallest and largest the observed 
values, respectively: 


@ = Hil « b= xp. 
It is important to note that the remaining part of the observed values: 
8, Tp ace Mail 
does not provide any information about the estimate of a and b. Since the 
extremes x; and x, contain all the information necessary for the estimation, 


they are said to be a sufficient statistic for (a, b). 
This concept will be developed in more detail in the next section. 


420 10 Statistical Inference and Likelihood 


hypothesis | a observation 


L= Tip) 


tee ne Tara ee eae 


observation x 


Fig. 10.2 Intuitive justification of the maximum likelihood method: the best density reproducing 
the data is the one in bold, which maximizes the product of the ordinates 


We have shown what the ML method consists in and the results it provides. 
However, we believe that it is useful to suggest an intuitive argument that explains 
why the method works, that is, why it gives reasonable parameter estimates. Look at 
Fig. 10.2, where the maximization of the likelihood function is sketched as a shift 
of the p(x, 0) density along the abscissa axis (here @ is a location parameter of 
the distribution). The observed values are on the abscissa axis, represented by bold 
points; they will concentrate in one or more regions, with some values dispersed in 
the other areas. The likelihood function is obtained by the product of the ordinates 
of the observed data calculated through the density function. As can be seen from 
the figure, the maximum likelihood is obtained when the density is “best adjusted” 
to the data. The ML parameter estimate corresponds to this maximum. 

Now is the time to make a little effort of abstraction and study the fundamental 
results of the estimator theory. After this step, developed in the next two sections, 
the theory of estimation will appear clearer, and probably even more interesting. 


10.3 Estimator Properties 


In this section we consider the 6 parameter as a scalar, but Eqs. (10.2), (10.3), 
and (10.4) hold without modifications also in the multidimensional case. To begin 
with, let us go back to the definition of estimator, which we briefly mentioned in 
Sect. 2.11. 


Definition 10.2 (Point Estimator) Given a sample X of size n from an m- 


dimensional random variable X with density p(X; 0), with 6 € ©, a point estimator 
(or estimator) of the parameter @ is a statistic (see Definition 2.13): 


Th = tr(X) , (10.10) 


with values in 0. 


10.3 Estimator Properties 421 


The estimator is then defined as a function T : S — © that maps from the sample 
space S to the parameter space ©, used to estimate 6. As we know, the mean M 
and the variance S* of a sample are estimators of the parameters js and o7. The 
conventions we will use for estimators are as follows: T,, (or T) is a random variable 
from Eq. (10.10), t,(-) is the associated functional form and t¢, (or f) is an occurrence 
of 7, (or T) after a trial or experiment. 

A reasonable estimator should get closer and closer to the true value of the 
parameter as the number of observations increases. In this respect, a useful property 
is consistency: 


Definition 10.3 (Consistent Estimator) An estimator 7, of the parameter 0 is 
called consistent if it converges in probability towards @ according to Eq. (2.73): 


lim P{|T—@|<e}=1, Ve>0. (10.11) 
n—-> oo 


As we can see, the consistency of the estimator requires only the convergence 
in probability. However, for the cases considered in this text, the almost certain 
convergence of Eq. (2.74) also holds. Using the expected value of an estimator (see 
Sect. 2.11 for this important concept), from Tchebychev’s inequality (3.92), it is 
easy to demonstrate that a sufficient condition for Eq. (10.11) to hold is: 


lim (T,) =6, (10.12) 
noo 
lim Var[T,] = lim (72) - (T,)?) =, (10.13) 
n—> oo n— oo 


If n remains finite, it is reasonable to expect that also the true mean of the 
distribution of the estimators 7, coincides with 6. However, this condition is 
not always satisfied: an example that we have already discussed several times is 
that of the incorrect sample variance (6.59). Therefore, the following definition is 
necessary: 


Definition 10.4 (Unbiased Estimator) An estimator 7, of a parameter 6 is unbi- 
ased when: 


(Tn) =0, Vn. (10.14) 
Otherwise, the estimator is called biased and one has: 
(Tn) =O + bn , (10.15) 


where b,, is called systematic effect or bias. In general the bias depends on 6. When 
Eq. (10.14) does not hold, but Eq. (10.12) remains valid, then: 


lim b, =0, 


no 


422 10 Statistical Inference and Likelihood 


and the estimator is called asymptotically correct. 


The variance (6.59) is then a consistent, biased and asymptotically unbiased 
estimator. 

Besides consistency and unbiasedness, the third important property of an estima- 
tor is efficiency. 


Definition 10.5 (Most Efficient Estimator) Given two unbiased estimators 7,, and 
Q,, of the same parameter 0, T,, is more efficient than Q, if the relation: 


Var[Tn] < Var[Qn], VOEO (10.16) 


holds. 


Clearly, having to choose between two estimators, all other conditions being equal, 
the more efficient one is preferable, because it allows us to obtain smaller confidence 
intervals for 6 estimation. 

Another important feature related to statistic is sufficiency, introduced by R.A. 
Fisher in 1925. We report the simplest formulation of this property, which makes 
use of the likelihood function: 


Definition 10.6 (Sufficient Statistic) The statistic S, is called sufficient if the 
likelihood function (10.5) can be factorized into the product of two functions h 
and g such that: 


L(O; x) = 8 (n(x), 6) A(x), (10.17) 


where h(x) does not depend on @. 
For a multidimensional parameter, the function g is written as: 


8 (sn(X), Qn(X),.-., 4) 


and one says that the statistics S,, Qn, ... are jointly sufficient for 6. 


In some texts, Eq. (10.17) is denominated as factorization theorem. In fact, the def- 
inition of sufficiency is sometimes defined in very general terms, and subsequently 
Eq. (10.17) is shown to be a necessary and sufficient condition for the validity of 
this property. In this text we instead adopt the factorization theorem as a definition 
of sufficiency. 

In practice, the sufficient statistic contains all the information about the parameter 
to be estimated. Indeed, when deriving Eq. (10.17) with respect to 6 to obtain the 
maximum likelihood, the function h(x) plays the role of a simple constant and is 
therefore irrelevant in the estimation of 9. It is also clear that, if S is a sufficient 
statistic, a statistic Q = f(S), where f is an invertible function of S, is also 
sufficient, since the likelihood function can be written under the form: 


LQ; x) = g(f—'(q), 9) h(x). (10.18) 


10.4 Theorems on Estimators 423 


A suitable example of sufficient statistic is given by the extreme values of a sample 
drawn from the uniform density, as discussed in Exercise 10.4. Instead, an example 
of jointly sufficient statistics is given by the mean and variance estimators for 
Gaussian samples: 


T=) X}, Qn=>)X. 
i i 


Indeed, from Exercise 10.3 one has: 


1 eee ee 
L(f,.05 3) = eet DAW 


(/2x 0)" 


— —— é yz (Li x} ae di a we di) ’ 
(/2m 0)" 


from which: 


1 


o” 


1 1 2 
Ze Foz (in (x) 2ugn(x)+nye dh =a (tn (x), dn(X), LL, 0) h , 


L(u,o; x)= 
with h being a constant. This formal treatment corresponds to the well-known 
practical fact that to estimate the mean and variance of a sample, it is enough to 
calculate the mean of the squares and the square of the mean. 


10.4 Theorems on Estimators 


In this section we have gathered all the important results of the theory of ML 
estimators and the Cramér-Rao lower bound theorem, also valid for other estimators. 

For simplicity, we will consider populations with probability density p(x; 0) 
depending on a scalar parameter 6. However, the formulae remain valid even in 
the case of a vector of parameters, if 6 is replaced with the vector 6 and the partial 
derivative 0/00 with 0/00; with respect to any k-th element of the vector 6. Multi- 
parameter generalization will only be discussed in cases where this procedure is not 
entirely obvious. 

The first theorem links the ML estimators to the sufficient statistic, showing that 
the sufficient statistic is the best way to summarize the experimental information 
about 0. 


Theorem 10.1 (Sufficient Statistics) Jf S, = s,(X) is a sufficient statistic for 0, 
the ML estimator 0, if it exists, is always a function of Sy. 


Proof From Eq. (10.17) it results that L(6; x) has the maximum 6 at the point 
where g (Sn (x), 0) has its maximum and this point obviously depends on x through 
5, only. The theorem is easily extended to an set of jointly sufficient statistics. O 


424 10 Statistical Inference and Likelihood 


Theorem 10.2 (Reparameterizations) /f 7 € H is a parameter depending on 
another parameter 0 through a one-to-one function g :@ — H 


n=8(9), (10.19) 
the ML estimate of n is given by: 
i = (6), (10.20) 


where 6 is the ML estimate of 0. 


Proof We first prove that 7 and @ have the same likelihood function. In fact, since 
g is invertible, one has: 


Lo; x) = Lo(g'(); X) = Lyn: X), (10.21) 


where we have highlighted the equality between two different functional forms, Lg 
and L,,. For instance, if 7 = In @ and 6 is the mean of a Gaussian, one has: 


or 


1 (& —e’) 
Ly « exp <5 42 


We also note that this transformation does not have the complications seen with 

random variables, where the Jacobian determinants are necessary to change the 

differentials, because what is transformed here are the parameters, not the variables. 
Since, by definition: 


Lo (6; x) = Lo(6; x) for any 6, 


and @ is in one-to-one correspondence with H through g, by applying Eq. (10.21) 
to both members of the inequality, one has: 


L,(g(6); x) > Ly(n; x) for any 7, 
and hence: 
n= g(0). 


oO 


This theorem is very useful: for example, all the quantities that are function of the 
sample mean can be considered as a result of ML estimates. 


10.4 Theorems on Estimators 425 


We will now assume that the so-called regularity properties are satisfied: 


¢ 6 belongs to 0. 

¢ The function p(x; 0) is a p.d.f. for any 6 € O. 

e The set {x : p(x; 6) > O}, which is the p.d.f. support, does not depend on @. 

¢ If 6, 4 62, then there exists at least a set B € F for which be D(x; 01) dx # 
Sp P(x; 62) dx. 

¢ p(x; @) is differentiable for any 0 € O. 

¢ The operations of sum and integration in x and of differentiation by 6 can be 
exchanged. 


In order to avoid confusions when studying the likelihood function, it is always 
necessary to check whether analysis is performed by considering L(0; X) as a 
random function of X with @ fixed or as a function of the parameter 0 for a particular 
observation x. Therefore, we recommend to pay attention to the upper or lower case 
notation to easily understand the following discussions. For example, averages of the 


type: 
2 
cs In L) ; 
(3 


refer to functions of random variables as In L(@; X) at fixed 0 values, whereas the 
ML method applies to the sample of observed values x and consider the likelihood 
as a function of 0. 

The most important theorem on the estimator variance was formulated by the 
statisticians Cramér and Rao in 1944-45, who established a lower bound for it. This 
theorem can be easily understood if some fundamental relations are kept in mind. 

The first one exploits the fact that, if the regularity conditions are valid, the 
density function is normalized independently of 0: 


[ pcre ax =, 
and hence: 
dp(x; A) a 
| 90 saeaery p(x; @) dx =0 (10.22) 


From this relation one can also show that: 


y 
(Sinpixie)) =o. (10.23) 


426 10 Statistical Inference and Likelihood 


Indeed: 
dp(x; A) / 1 dp(x; 6) 
ii ———"—* n(x: 6) d. 
/ rad ae 9 Tee 
0 In p(x; 0) a 
= | ——— p(x: 6) dx = (— In p(X; 6)) =0. 
/ 30 p(x; @) (5 n p( ) 


Obviously, also the following relation is valid: 


2/ Op 8) 4 | 0° p(x 8) 


30 30 a02 
1 d?p(x; 6 1 a? 
=| RD eopvapait Pa G... Goad) 
p(x; 0) ae? p 30? 


Equation (10.23) shows that the derivative with respect to 6 of the logarithm of a 
density, the so-called score function, is always a function of X with zero mean. The 
variance of the score function is known as Fisher information and, for @ present in 
the p.d.f. of X, is usually denoted as /(@). Recalling Eq. (2.67), it is given by: 


V: a4 (X;0)| = (= (X; 6) ae aa))), 
|Z inp ; ie 39 7? ; -(Z inp , 


9 2 
= (C In p(X; ») =I1(0). (10.25) 
Notice that the remarkable relation: 
d : a 
(0) (= n p(x; ) (= n p(X; ). (10.26) 


is valid, since: 


a a alnp 1 (ap\* 1p ain p\? 
— In PI=\Wa = { —-— | — +—-—~)= — , 
062 06 00 p? \a6 p 002 06 

where Eq. (10.24) has been used. 

It should be kept in mind that the fundamental relations (10.22)—(10.26) are 
valid for any density satisfying the regularity conditions summarized above. We 
recommend to do some exercises and check them for some of the known densities, 
such as binomial and Gaussian. It is important to notice that these relations also 


hold for the likelihood function, which, as we know, is the p.d.f. of a sample of 
size n of one or more random variables (apart from a constant factor). For example, 


10.4 Theorems on Estimators 427 


Fisher information about @ contained in the likelihood is related to that contained in 
P(xj;; 9) by the crucial relation, correspondent to Eq. (10.25): 


(zm) | 7 (3 Yin px 0) | = (Sm) | =nI(0), 


(10.27) 


where Eqs. (5.81), (10.5), and (10.25) have been used together with the fact that the 
sample consists of n independent occurrences of X. 
A last useful equation applies to any (unbiased or biased) estimator T,, of 0: 


(x a In L(6; ») = -(n alo; >| me (10.28) 


00 06 06 


where tT(@) = (T,) and t(@) are assumed differentiable in 6. Notice that (9) = 0 
for an unbiased estimator and that, in this case, the last member of Eq. (10.28) 
is equal to 1. The previous formula can be easily derived from the equivalence 
between the density function of the sample and the likelihood function and from 
the regularity conditions, since: 


dt(@) 8 
= 17, 
FY) 96 (in 
oL(@; 
-fo dx 
FY) = 
1 dL(6; x) dInL(@; x) 
= | Tf, ————-—___= L(6: x) dx = | T, ——_—= L(6: x) d 
fras ST LO: 2) dx [o = 16; x) dx 


a 
= (2 In L(6; x) 


We can now prove the 


Theorem 10.3 (Cramér-Rao Bound) Let p(x; 0) be the p.d.f. of X and let T,, be 
an estimator of @ with finite variance based on the sample X. If t(@) = (Th) is 
differentiable and the information I(@) (10.25) of the density p remains finite for 
any 0, the estimator variance can never be less than 1’ (6)? /[n1(6)]: 


! Ps / 2 
Var Tn] > eeeee eee an (10.29) 
ht (a In p(X; 6) nI (6) 


Proof From Eqs. (10.23), and (10.28) it results: 


a ) ; 
(15 In 1] = (<r, —6) ap In 1) =1'(0). (10.30) 


428 10 Statistical Inference and Likelihood 


By squaring this expression and applying the Cauchy-Schwarz inequality (4.29), 
one obtains: 


2; a ‘: a \ 2 
((q, — 8) (Sz) > (t—0) (Smz)] =71'(6)?. (10.31) 


From Eq. (10.27) one then obtains: 


a\ 1'(6)" 
((t, — 0)?) = Varl Tn] = aN 


Oo 


If 7, is an unbiased estimator of 0, the Cramér-Rao lower bound becomes 
1/[nI(@)]. The Cramér-Rao theorem allows to evaluate in a precise way the 
estimator efficiency through the definition of its lower bound. Indeed, the ideally 
correct estimator, based on Definition 10.5, is the one with minimum variance, that 
is, the estimator which satisfies the condition: 


Var[Tn] = (amt) “ feiney 10.32 
at[T,] = 0 me ry ~ RI)” (10.32) 


This estimator, among the unbiased ones, is considered as the most efficient or as 
the best estimator. It is also obvious to define the efficiency of a generic correct 
estimator as: 


(Ty) = [Var[T]nZ(@)\"' . (10.33) 


For the best correct estimator, the condition ¢(7,,) = 1 holds. An estimator that is 
not the most efficient is also said to be inadmissible. 

The variance of the most efficient estimator, that is, the optimal confidence 
interval of @, is small when the information is large. This explains the word 
information given to I. 

For a p-dimensional parameter 0, estimated by T,,(X) in an unbiased way, the 
generalization of Eq. (10.32) provides the variance and covariance matrix of T;,. Its 
(i, j) elements are given by : 


a7 InL 
06; 00; 


al 
(nj)! = Cov{T;, Tj] = (Ti — 6;)(T; — 9;)) = ( . (10.34) 


where 7;,i = 1,..., p, is the i-th component of 7). 


10.4 Theorems on Estimators 429 


Exercise 10.5 

Find the information on the probability contained in the binomial distribution 
and that on the mean for the Poisson and Gauss distributions. Comment on 
the results obtained. 


Answer Denoting by b(x; p), p(x;m) and g(x;j,0) the binomial, 
Poissonian and Gaussian p.d.f.s, respectively, one easily derives from 
Eqs. (2.29), (3.14), and (3.28): 
Inb(X; p) = Inn! —In(v — X)! —InX!+ XInp+ (n— X) Indi — p), 
In p(X; w) = X Inu -—InxX!—p, 


a 1 1(X-p\? 
Ing(X; 4,0) = In( —) ->( = ) : 


Notice that these functions are random variables, because they are function 
of X (in capital letter). By differentiating with respect to the parameters of 
interest one obtains: 


n—xX X —np 


F) Xx 
IMAG jo) = == 
dp P 


l1-p pp)’ 
a x xe 
meet RON ie les es (10.35) 
du a Me 
a) xX—p 1 X-—wU 
Sine Xo eS (-=) —— 
Om oO Oo oO 


All these derivatives have null mean, according to Eq. (10.23), since the 
difference (X — jz) appears in their numerators. 

The information can be now calculated by applying Eqs. (10.26)—(10.35), 
by using the square of the first or the second derivative, whichever is more 
convenient: 


1 np(1 — p) n 
a eae (x ~ np) ped—py pd—p)’ 
I(u) = (= He Set : (10.36) 
Oa ae ratty ee , 
1 2 1 
1) =((X-w)=Ga5, 


(continued) 


430 10 Statistical Inference and Likelihood 


Exercise 10.5 (continued) 


where the property 0 = ((X = u)’) has been applied to the explicit form of 
the variances of these densities (see for example Table 3.1). 

What considerations do these findings now suggest? 

First, the information is proportional to the inverse of the variance: a 
“narrow” density, with little dispersion around the mean, will have high 
information, as is intuitive. 

Furthermore, if we introduce the frequency estimator: 


T,=F=-—-, 
n 


which estimates the probability based on the number of successes X and the 
sample mean estimator: 


T=M=>—, 
i 


which evaluated the mean yz from n variates of X, we see that, for these 
estimators, the Cramér-Rao bound coincides with the statistical uncertainty 
deduced in Chap. 6: 


1 1- 
Var[ j= = 
1 ae 
Oa lay ary as 


For large samples, this error is evaluated under the approximations p ~ f and 
o* co which provide the well-known estimation intervals (6.33) and (6.50). 

We deduce that the frequency is the best estimator of the probability that 
appears as a parameter in the binomial density and that the sample mean is 


the best estimator of the mean of the Poissonian and Gaussian distributions. 


Given all these premises, we can now introduce the pivotal ML theorem, the one 
that assigns to the method the fundamental role in parameter estimation. 


Theorem 10.4 (About the Most Efficient Estimator) /f T,, is an unbiased estima- 
tor of T(0) with minimum variance (i.e. the best estimator), it coincides with the ML 
estimator, if it exists: 


T, = c(8). 


10.4 Theorems on Estimators 431 


Proof Since the Cramer-Rao lower bound is valid for the best estimator, we can 
write, from Eq. (10.29): 


n1() 


= 2 = 
(it T(0)] TOR = 1. (10.37) 
This relation is satisfied if and only if: 
dInL _ nl(6) 
a0. tO) [Tn — T(6)] . (10.38) 


In fact, Eq. (10.37) is easily obtained if we square Eq. (10.38), take the expectation 
value and use Eqs. (10.27). 

On the contrary, if Eq. (10.37) holds, then the Cauchy-Schwarz inequality (10.31) 
becomes a strict equality, so 0 In L/00 = c[T, —t(@)] for a given constant c. Taking 
into account also Eq. (10.30) one can write: 


nl(@) 1 dinL\*\ _ gy) oink 
0) T'@) ( 06 ) = «(Im Ole 36 oa lea 


from which, since c = nI(@)/t' (6), Eq. (10.38) follows. 
If now x is fixed and 6 is variable, the ML estimate is obtained setting Eq. (10.38) 
to zero: 


dinL nl) 
a0 [r'(6)? 


[JT —t(@)]=0, from which T, =T(6). 


oO 


We now investigate the asymptotic properties of ML estimators, starting from 
consistency. The rigorous proof of its subsistence can be found in [Cra51, Azz96, 
Jam08]; here we will present a simple heuristic argument. Equations (10.23) 
and (10.26) shows that, for the law of large numbers: 


’ 


1 dink © + eren 


n 


converges in probability (and also almost surely) to: 


0 In p(X; @) = 
(mn 


432 10 Statistical Inference and Likelihood 


and that: 


n 


10°mnL 1 5: 0? In p(xi; 8) 


n 002 nn 002 


converges in probability (and also almost surely) to: 


59? =-1(6) <0, 


(= In p(X; 2 
if 0 is the true parameter value. This shows that, as n increases, the first derivative 
of the likelihood function calculated at 6 tends in probability to zero and that 
the second derivative is negative in probability. We are therefore led to think that 
the distance between @ and the maximum point 6 of the log-likelihood tends in 
probability to zero, which corresponds precisely to the consistency of the ML 
estimator. 

Let us now consider a series of experiments, in each of which we obtain a value of 
x and maximize the likelihood function with respect to @. The set of the 6; estimates 
thus obtained forms a sample of the random variable O. If we perform an infinite 
series of experiments, we will get the true distribution of ©. Will it be Gaussian? 
The asymptotic normality theorem ensures that, if the sample size n is large enough, 
the answer is affirmative. We give only a hint of the proof, avoiding in particular to 
precisely consider the negligible terms (according to the convergence in probability) 
in the Taylor series expansions. 

Considering, for simplicity, a single-valued random variable X, we develop the 
derivative of the logarithm of L around the true value 0: 


1 a3inL 
2 063 


dInL 


_ ainL| | and 
a0 


6 00 00? 


| (6-0) + | (6-6)? +... (10.39) 
0 0 


If n is the sample size, we can write: 


1 dink 1 din p(x; 0) oy ere ‘ 
a =->> —— | +- ) —— | 6-8) 
A 2 
n 00 |g n& a0 le ce a0 4 
1 OPIN ps 8) | oe 
a | SO SO ee 10.40 
se 2 903 « ee ( ) 


Since the hypothesis of the consistency of the ML estimator is valid, for n large 
enough, (6 — @) will become small, and, if the average values of the derivatives 
remain bounded (regularity condition), the terms higher than the first order can 


10.4 Theorems on Estimators 433 


be neglected in Eq. (10.40). Moreover, since for 9 = 6 the first derivative of In L 
vanishes, Eq. (10.40) becomes: 


1 ~ d In p(xj; 0) 
n 


1 87 In p(x; 9)| » 
wo SS a ea, 
00 oa ’ ea 


06? 


i=1 8 


The previous formula can be rewritten as follows: 


Tat @) TO) ~VJnl(@)(@-8) . 


- 2 . 

an dln PO s8) , = ae Cl) eat) , 
From the Central Limit Theorem and Eqs. (10.23) and (10.27), the numerator of the 
first member converges in distribution to a standard Gaussian, while, by the law of 
large numbers and Eq. (10.26), the denominator almost certainly converges to 1. It 
can then be shown that this implies the convergence in distribution to a Gaussian 
for the whole ratio at the first member. Therefore, this also applies to the second 
member, which we consider identical except for negligible terms. We then conclude 
that, for large n, we have approximately: 


: 1 
(6-0)~N (0 =a) (10.41) 


We can finally state the important 


Theorem 10.5 (Asymptotic Normality) Jf the regularity conditions of Sect. 10.4 
hold, the ML estimators are asymptotically normal with an expected value equal to 
the true value of the parameter and have asymptotic efficiency equal to 1. 


Proof Equation (10.41) shows that the estimator 6, for large n, is normally 
distributed with mean @ and variance given by the Cramér-Rao bound (10.29). O 


In practice, the values of n at which the distribution of 6 can be approximated with a 
Gaussian depend both on the sample parent population p(x; @) and on the estimator 
type. For the sample mean, as we already know, the normality is reached quite fast 
(n > 10). For other estimators, such as the sample variance, the convergence to 
normality is much slower. 

All previous arguments show that the ML method provides consistent, asymp- 
totically correct (with a distortion factor O(1/n), as in the Exercise 10.3) and 
asymptotically normal estimators with variance given by the Cramér-Rao bound. 
Estimators of this type are called BAN (Best Asymptotically Normal). 

These properties make maximum likelihood the most used method in statistics 
for the point estimation of parameters when the density p(x; @) is known a priori. 


434 10 Statistical Inference and Likelihood 
10.5 Confidence Intervals 


The point estimation of the parameters through the maximization of the likelihood 
also provides the elements to carry out the interval estimations. Indeed: 


(a) We know, from Theorem 10.5, that (6 — @) is asymptotically normal with null 
mean and variance | /[n/(@)]. 

(b) During the proof of Theorem 10.5, we have seen that (0 In L/00) has zero mean, 
variance n/(@) and nearly normal distribution for large n. 


These two methods give practically the same results for the interval estimation. If the 
estimated information [ (6) is used instead of the expected one (which correspond, 
in elementary statistics, to the plug-in rule s ~ o@), usually the confidence intervals 
already presented in Chap. 6 are found. The distortion of the interval introduced by 
this approximation is studied in detail in [Jam08] and is of the order of 1/n. You 
can go deeper into these aspects by solving Problem 10.8. 

Let us now look in detail at a third method for the determination of confidence 
intervals, which is of fundamental importance in multidimensional cases. We 
anticipate the result, which is: 


(c) The variable 2[In L( 6) — In L(@)] is asymptotically distributed as x7( Pp), where 
p is the size of 6. 


For simplicity, we consider an asymptotic approximation of 2[In L(6) —InL(@)] in 
the one-dimensional case and expand, up to the second order, the negative logarithm 
of L(@) around 6 , where @ is the true value of the parameter. Neglecting, as before, 
the higher-order terms according to the convergence in probability, we have: 


LO) =~ £L(6) + £'(6)(6 — 6) + 5 £0 — 67 


~ £0) +4 e £66 6)? ~ £6) - “ue (0-6. 


The error term in the first row, which is 0(6 — 6)2, can be neglected, thanks to the 
consistency of 6. This justifies also the exchange of 6 with 6 in the argument of £”; 
finally, for the last step, the law of large numbers was used. Therefore, we can write: 


2[In L(6) — InL(6)] ~ nI(6) (6 — 0)° . (10.42) 
We now perform a reparametrization using an invertible function 7(@) such that we 


have nI,(7) = 1; the function n(-) which fulfills this requirement is any primitive 
of /nI(@). Equation (10.42), reformulated as a function of 7 becomes: 


Low 
In Ly(H) — InLy(n) = 5G — n)?. (10.43) 


10.5 Confidence Intervals 435 


From Theorem 10.5 and taking into account the reparametrization, we know that 
approximately 7 ~ N(n, 1). Therefore, the confidence intervals for 7 of half-width 
equal to one or two standard deviations are respectively 7 + 1 and 7 + 2, with 
an (approximate) confidence level of 68.3% and 95.4%. Thanks to Eq. (10.43), the 
extremes of the corresponding confidence intervals are calculated, with respect to n, 
with the equations: 


InL, (7) —InL,(n) = 0.5, 
InL,(j) —InL,(y) = 2. 


The same result can also be obtained by noting that (7 — n)? ~ x7 (1), since (7 — 
n) is distributed as a standard Gaussian. Then, if 1—a = CL is the confidence level, 
the extremes of the corresponding interval will be given by the 7) values satisfying 
the equation: 


2 (In Ly(#) — InLy(m)) = xg), (10.44) 


where x2 is the w quantile of the x7 distribution with one degree of freedom. If 
a = 0.683, then x2) = 1.00, which corresponds to solve the equation [In L,, (4) — 
In Ly(y)] = 0.5 for 7. 

We have then found a reparameterization that produces symmetric intervals 
around 7 with Gaussian confidence levels. But now the important point comes: it is 
not necessary to explicitly perform the reparameterization, it is enough to know that 
it exists. In fact, thanks to Theorem 10.2, we know that numerically the likelihoods 
Lg and L,, are equal; then one can just use the original likelihood Lg and find the 
values 6; and 92 for which: 


2A[in L] = 2 (in L(6) —In L@)) = x2(1) ; (10.45) 
to obtain the interval estimate for 0: 
0, <0 <6, CL=1-a. (10.46) 


For instance, to determine the confidence intervals with CL = 68.3% and CL = 
95.4%, the extremes which give 2A[In L] = 1 or 2A[In L] = 4 must be found. An 
application of this method is shown in Problem 10.8. 

If the likelihood function is sufficiently regular, it is therefore possible to 
determine both the confidence intervals and their corresponding confidence levels. 
They are Gaussian but do not correspond to Gaussian-like intervals, since in general 
they are asymmetric with respect to the point estimate 6. The intervals that are 
more similar to the Gaussian ones are those of the transformed parameter 7 of 
Eq. (10.19), whose existence is warranted by Theorem 10.2 and which have been 
used to determine the variation of the likelihood as a function of the confidence 
levels. All these remarks are schematized in Fig. 10.3. Moreover, the consistency 


436 10 Statistical Inference and Likelihood 


Sere neencaters | S  InL(0) 


Fig. 10.3. Determination of the confidence intervals for the ML estimation of a one-dimensional 
parameter. The probability levels indicated in the figure reflect the fact that —InL ~ x?/2 and 
that the y7(1) quantile takes the values 1 and 4 for cumulative probabilities of 68.3% and 95.4%, 
respectively 


of 6 ensures that, as m increases, the estimation interval (10.46) also tends to 
be symmetrical around @. For a p-dimensional parameter 0, the boundary of the 
confidence set is found using the condition equivalent to Eq. (10.45): 


2A[In L] =2 (in L(6) —In L(6)) = x2(p), (10.47) 


where < is the quantile of the assigned CL and the asymptotic distribution of 
2A[ln L] is x7 (p). The values ea as a function of the degrees of freedom, can be 
read in Table E.4. For example, from this table we see that, with two degrees of 
freedom, the regions enclosed by the contours 2A[In L] ~ 2.4 and 2A[In L] ~ 6.0 
correspond to CL ~ 68% and CL ~ 95%, respectively. Similarly, we find intervals 
by solving 2A[In L] ~ 1 and 2A[In L] ~ 4 in the one-dimensional case. 

Equation (10.47) requires to explore the x7 hypersurfaces, and its application is 
then often difficult. In practice, almost always, the p one-dimensional confidence 
intervals, each of level a, are determined numerically by varying one parameter 6; 
at a time (according to a grid of values) and maximizing the likelihood with respect 
to the other parameters for each value of 6;. The interval (6;1, 0;2) is obtained by 
solving, with respect to 6;, an equation similar to Eq. (10.45): 


2[In L(6) — In L(6;, 0(6;))] = x2). (10.48) 


Here L(6;, 6(6;)) is the maximized likelihood with respect to all the other compo- 
nents of 6 having fixed 6;, and L(@) is the best fit value obtained by maximizing all 
free parameters. This procedure is justified considering the following identity: 


In L(6) — In L(6) = [InL(@) — InL(6;, 6(6;))] + [In L(G;, 0(6;)) — INnL(@)] . 


10.6 Least Squares Method and Maximum Likelihood 437 


Fig. 10.4 Forms assumed by 

the confidence regions given 0, 
by Egs. (10.47) (darker 

region) and (10.45) (lighter 

band). The dark region is the 

random region which 

contains the true value of the 

pair (6), 02) with probability 

39.3%; the light band, which 

is the projection of the dark 

region on the abscissa axis, is 

the occurrence of the random 

interval containing with a 

68.3% probability the true 

value of 0 (or of 61, if the A A 
other axis is considered) 8 22 s(0 2) 


A 


9» 


The first member has the asymptotic distribution x*(p), while the second addendum 
of the second member has the asymptotic distribution x7(p — 1), being in fact 
2A[In L] when 6; is known. This suggests, by analogy with the y* Theorem 3.4 
of additivity discussed in Appendix C, that the first term on the second member has 
an asymptotic distribution x71). The errors found with Eqs. (10.47) and (10.48) 
have the meaning shown in Fig. 4.8 and in Fig. 10.4 for the two-parameter case. The 
outline of the darker region corresponds to Eq. (10.47) for a variation of a x7 unit: 
according to Table E.4, with two degrees of freedom, this contour has a confidence 
level of almost 40% (the exact value is 39.3%, as seen from Eq. (4.83)). Instead 
Eq. (10.48) corresponds to the Gaussian confidence level in one dimension for 62, 
shown by the light-coloured region of the figure. The errors usually provided by 
the minimizing programmes, if the boundaries of the x7 regions are not explicitly 
required, are calculated with Eq. (10.48) and refer to a CL = 0.68 for each single 
parameter, regardless of the others. 


10.6 Least Squares Method and Maximum Likelihood 


Let us consider the observation of n independent Gaussian variables, coming from 
n different Gaussian distributions. In this case, the likelihood function is: 


eh ips 2 
_ 1% = ui) ) . (10.49) 


” 1 
L 0; => —=_=__——-- ee 
( x) I] ies 6) exp ( 5) o? (6) 


where, in general, the parameters yz and o depend, in turn, on a multidimensional 
parameter 6. Note that the previous formula generalizes Eq. (10.4), where the n 
measurements came from the same population. As a matter of fact, the 7 populations 
are now different, although they all have the same Gaussian density. Since, also in 


438 10 Statistical Inference and Likelihood 


this case, the likelihood represents the probability density of the sample, we can 
apply the same approach developed up to this point. The ML then allows to evaluate 
the parameters o and mu through the maximization of Eq. (10.49) or the minimization 
of its negative logarithm: 


_— 1 ys i= my? 
A -2"( sam) 42D a) 


Since: 


(10.50) 
=! 


‘ 1 1 nt 1 n ‘ 
ar) Gar ere Gee = Sino? (0)) , 
> (aa) 3 nee) 2 n(o; (8)) 


i= 


and both the first constant term and the 1/2 multiplicative factor are inessential, 
Eq. (10.50) is equivalent to the minimum search of the function: 


tans? 
= —InL@; x) = J In(o?(@)) Ee 3 —— (10.51) 
i=1 i=l 9; 


If we assume the standard deviations 0; to be known (or approximated with the 5; 
estimated in previous experiments), the first term of the Eq. (10.50) is constant, and 
the ML method is reduced to the search for the minimum of the x” function: 


n a 2 
oe fee ; (10.52) 


by setting to zero the derivatives: 


ie 

Bx*(* HE) _ G19 ck (10.53) 
00; 

Equation (10.53) represents the least squares (LS) method, which is discussed in 

detail in Chap. 11. Here, this procedure turns out to be a consequence of the ML 

approach when the data come from populations having a Gaussian density of known 

variance and unknown mean to be determined. 

The important feature of the LS method is to require, for its application, only 
the knowledge of the expected value and of the variance of the observed variables. 
Moreover, after the minimization, it is always possible to calculate the final x value 
using the parameter best fit values at the minimum ;4( 6): 


n aa 2 
xr = = Qi = Hi)” : (10.54) 


i=l 1 


10.7 Best Fit of Densities to Data and Histograms 439 


and it is then also possible to perform the x? test, according to the procedure 
discussed in Sects. 7.5 and 7.6. 

Although the parameters 6; are estimates and not the true values, under certain 
assumptions, which will be presented in Chap. 11, when the sample size n is large, 
the variable (10.54) actually tends to the x7 density with (n— p) degrees of freedom, 
where p is, as usual, the size of 6 [SW89]. Using this value, it is possible to 
verify whether the functional forms assumed to calculate the true means j1; are 
compatible with the data. It is important to note that, while the mathematical part of 
minimization or maximization can always be performed, hypothesis testing with 
is only meaningful when the involved variables are Gaussian. 

One of the most important applications of the LS method is the study of 
histograms, as shown in the next section. 


10.7 Best Fit of Densities to Data and Histograms 


Here we resume and complete the topics we have already presented in Sects. 6.14 
and 7.5, regarding the estimate of the parent population of the data. 

Given a sample of raw data xj, i = 1,2,...,n, the most direct way to fit a 
density p(x; @) to the data sample, when an appropriate code is available, is to 
minimize the function: 


LO) = -2 y In (p(x, #)) . (10.55) 


i=1 


The R routines optim and mle minimize a user-supplied function fn with 
sophisticated algorithms. Our FitLike routine manages the calls to optim and 
gives the output results. 

For example, the instruction to fit a set of 1000 simulated Gaussian data with 
4 = 70 and o = 10, contained in a gauss vector, to a Gaussian distribution are 
the following: 


>gauss<-rnorm{1000,mean=70, sd=10) 
>f<-function(par,x) {(0.399/par[2]) «exp (-0.5« ((x-par[1]) /par[2])*2) } 
>FitLike (x=gauss, parf=c(65,11) ,fun=f) 


This code, given the initial conditions contained in parf, returns the values 4 = 
69.3+0.2and6 = 10.1+0.2. Errors are evaluated with numerical algorithms based 
on Eq. (10.48). This method is very efficient in estimating the parameters, but does 
not allow the user, after the minimization, to easily perform goodness of fit tests. It 
should therefore be used only when the functional form of the density is certainly 
defined and when the original raw experimental data are available. 

A different procedure is possible with histogrammed data. In this case, the 
random variable defined in Eq.(10.51) is the number of events J; in the i-th 


440 10 Statistical Inference and Likelihood 


histogram bin of width A; (i.e. the occurrence J; = n;), whereas ju; is the expected 
(theoretical) number of events. From Eq. (6.99), j1; is given by: 


Hid) = vf P(x; 0) dx ~ Np(xoi; A)Ai = Npi() , (10.56) 
Aj 


where the assumed density p;(x0;, 0) is calculated in the bin midpoint xo; and N is 
the sample size. Here the variable x of Eq. (10.56) represents the support (spectrum) 
of X, that is the histogram abscissa. 

The likelihood function is proportional to the multinomial probability (4.89) 
of having n; events in the i-th bin over a total of k bins. Neglecting the factors 
independent of 6, one has: 


k 
LO; n)=][i@!" (10.57) 


i=1 


k 
L=-InL@; n) =— )°njIn[p;(6)). (10.58) 


i=1 


To find the ML estimate of the p-dimensional parameter 0, we can maximize 
Eq. (10.57) or minimize Eq. (10.58). Usually minimization codes are used, and 
logarithmic likelihoods are then minimized. 

It is interesting to verify that the minimization of Eq. (10.58) implies the least 
squares method, which is the most commonly used algorithm in these cases. In fact, 
we can differentiate Eq. (10.58) with respect to the j-th component of 0, 6;. After a 
sign change, one gets: 


> ni dpi) _ > ni — Npi(@) api (8) 
yj PiO) 96; pil) 00; 
The equality holds because, taking into account the property }°; pi(@) = 1 and the 
regularity condition (10.22), the sum of partial derivatives vanishes. 

It is easy to realize that the second member of the equality coincides, apart from 
a multiplicative constant, with the partial derivative of: 


4 (nj — Npi(0))* 
— —$—<—< 10.59 
=) ae 


when the denominator is regarded as a constant. 

The best fit parameters 6 can therefore be found by minimizing Eq. (10.59) with 
respect to the numerator. It is easily recognized that this procedure is nothing more 
than an application of the least squares method represented by Eq. (10.54). Since, 
in this approximation, the x7 denominator is kept constant, the modified x? is 


10.7 Best Fit of Densities to Data and Histograms 441 


sometimes used in the minimization process: 

2 (ni — Npi(O))* 
dX 7 (10.60) 

where the statistical errors have been estimated with the approximation ae = Np; ~ 

n;. Therefore, to estimate 8 we can use both Eqs. (10.58) and (10.60). As discussed 

in Sect. 7.4, the two methods give equivalent results, if the sample is large. 

However, Eq. (10.58) is the most general, since it is also appropriate even for 
small samples. Instead Eq. (10.60) is, in principle, valid only when all the bins have 
at least a few dozen events. In practice, one is often less restrictive, and a sample 
that has more than five events per channel is considered large enough. If this is not 
the case, adjacent channels with few events can be grouped into a single one before 
the x? minimization. As an advantage, unlike Eq. (10.55), once this procedure has 
been completed, one can proceed to the x7 test through Eq. (10.59). The number 
of degrees of freedom is equal to (v — p), where v is k or (k — 1), depending on 
whether the sample size N is variable or constant, respectively. 

This procedure fully implements the diagram of Fig. 10.1: a population model of 
density p(x; @) is devised and from the data the most likely density, given by the 
best fit parameter 6 with its error, is evaluated; then, the x7 test is performed to 
verify the hypothesis on the functional form chosen for the density. Now let us see 
in detail this procedure by examining some cases already discussed in the sections 
about basic statistics. 


Exercise 10.6 

Perform the best fit of the histogram (6.97) obtained with a computer 
simulation of 1000 variates from a Gaussian of true parameters x = 70, 
o = 10: 


ie nj Seg nj 
Bie iL 72) A07/ 
42.5 4 77.5 153 
47.5 16 82.5 101 
52.5 44 87.5 42 
SIS) 81 92.5 7 
62.5 152 97.5 6 
67.5 186 


Answer Assuming that the histogram comes from a Gaussian of unknown 
parameters (as would happen for a real, non-simulated data set), we perform 


(continued) 


442 10 Statistical Inference and Likelihood 


Exercise 10.6 (continued) 
the best fit of the data to a Gaussian distribution. Equation (10.56) then 
becomes: 


N pi 0) = N pi(u,o) = N pi (xi; uw, 0) Aj 


aed 2 
exp || -5, (1061) 


= 1000. 
2) o2 


210 


since N = 1000 and A; = 5. 

To apply the two different best fit procedures described before, we use 
our Gaussfit routine, which calls both the R minimization function 
mle and our routine Nlinfit, that you can find on our web site. The 
description of the code and of the statistical methods used in the estimation 
and determination of errors on the parameters can be found in the routine 
comment lines. 

This code, using Eq. (10.60), performs the minimization of a user-defined 
function, given as x7 dependent on p parameters. If requested, the likelihood 
function —21In L of Eq. (10.58) is used. At the end of the minimization, the 
final x value is calculated with Eq. (10.59), and the user can proceed to the 
yee test. 

Since to use the function x? (both in the minimization and in the test 
phase) an event content at least > 5 per bin is required, in the application 
of Eqs. (10.60) and (10.59), the x7 is calculated by grouping the first two 
channels. For example, Eq. (10.59) becomes (n; = 1, n2 = 4): 


= (nj +122 — Npi(u,o) — Npo(u,0))? — (n3 — Np3(, 0)" 
Npi(u, 0) + Npo(u, o) Np3(u, 0) 


The results are reported in Table (10.62), which shows the formula used for 
the minimization, the best fit parameter value with error, the final Nee value 
obtained from Eq. (10.59) and the observed significance SL (p-value) of the 
test, obtained from Eq. (7.33) for the one-tailed test, provided by Table E.3 
with (v — k) degrees of freedom. Since N is constant and the first two bins 
have been combined together, v = (13 — 1 — 1) = 11. Moreover, considering 
that two parameters have been minimized, k = 2 and hence (v — k) = 9. 


Equation ji 6 ae SL 
(10.58)  70.09+0.31 9.7540.22 9.18 ~ 42% (10.62) 


(10.60) 69.95 + 0.34 9.62+0.25 10.74 ~ 29% 


(continued) 


10.7 Best Fit of Densities to Data and Histograms 443 


Exercise 10.6 (continued) 

This table shows that the two methods used give equivalent results, even if the 
experimental sample was not particularly large. The estimate with the exact 
formula (10.58) is associated with a higher p-value, because a more accurate 
estimate of the parameters results in a better fit. 


Now let us go back to an old acquaintance of ours, the ten-coin experiment, that 
we discussed for the last time in Exercise 7.7. 


Exercise 10.7 
Find the probability p to obtain head in a single coin tossing through a best 
fit to the binomial density using the data of Table 2.2 (reported again here for 
convenience): 


ao O22 3 4 SS © 7 8 BY id 
nj 0 0 5 13 12 25 24 14 61 O 


Answer In this case, the parent population has binomial density (2.29), and 
Eq. (10.56) can be written as: 


N pi) = N pi(p) = N bai; 10, p) 
10! 
= N———_p*i(1— p)'0, 10.63 
ie (10.63) 
where N = 100 is the number of trials and the A; interval is missing because 
now the variable is discrete. The probability p is the unknown parameter to 
be determined through the best fit procedure. 

Similarly to the previous exercise, in the x* calculation, the first and the 
last three bins have to be grouped (nj = 0,n2 = 1,n3 = 5) and (no = 
6,719 = 1,11 = 0) to have a number of events > 5. The numerators of the 
x? function then become: 


[(0+0+5—Nb(10, p; 0) — Nb(10, p; 1) — Nb(10, p; 2) 
[3 = "Nolo: y- 3) oe 


Therefore, the histogram bins are 7 and the degrees of freedom are v = 7 — 
1 = 6, since the total number JN of trials is fixed. A further degree of freedom 


(continued) 


444 10 Statistical Inference and Likelihood 


Exercise 10.7 (continued) 
is lost because the probability is determined by data. The actual number of 
degrees of freedom is then (v —k) = 5. 

The results obtained with our code Coinfit are reported in Table (10.64): 
as in the previous exercise, we have minimized both the logarithm of the 
negative likelihood (10.58) and the modified x7 (10.60), and have performed 
the x test with Eq. (10.59). 


Equation p yO 1 SE 
(10.58) 0.52140.016 3.79 ~ 58% (10.64) 
(10.60) 0.528+0.014 4.17 ~53% 


Again, the two methods provide similar results. However, in the case of 
small samples as the present one, we recommend, as a general rule, to use 
Eq. (10.58). 

The last time we considered the ten-coin experiment in elementary statis- 
tics, Exercise 7.7, we estimated the probability directly from the data, based 
on the experimental result of 521 heads out of 1000 tosses. From (6.33) we 


then got: 
0.5210 —0.521 
ee So a) ee i aS. 
1000 


In this case, the optimization of the parameter with Eq. (10.58) has led to the 
same result obtained from the observed relative frequency of heads since both 
point estimates coincide with the ML estimate. Moreover, the sample size is 
large, so that no difference appears between the two different approximation 
methods. 


The experimental data and the best fit curves of the last two exercises are reported 
in Fig. 10.5. 


10.8 Weighted Mean 


An important application of the least squares method consists in finding the mean 
of Gaussian variables all having the same true mean jz but different variances. 

This is a common case in laboratory activities, where it is often necessary to 
combine the results of measurements of the same quantity (same true mean) carried 
out with different devices (different measurement errors). 


10.8 Weighted Mean 445 


30 
25 
20 


| n(x) a) 


200- BOY D) 


= 

i 

—) 
T 


4 40 50 60 70 80 90, 100 


Fig. 10.5 Experimental data with error bars and best fit curves (a) for a binomial density 
(Exercise 10.7) and (b) Gaussian (Exercise 10.6). To facilitate the comparison, discrete points 
(empty squares) of the binomial of Fig. (a) have been joined by segments 


As in the previous paragraph, we obtain the result starting from the maximum 
likelihood estimate under the hypothesis of Gaussian observations and then veri- 
fying that, more generally, this is derivable as a consequence of the least squares 
method. 

To exemplify the problem, we first propose an interesting question. Suppose 
you have a sample of n independent observations with the same mean jw and 
standard deviation o ; now group them into two samples of k and n — k observations, 
respectively. Since the relation 2 = (1 + 42)/2 holds for true averages, intuitively 
we are led to think that the average of the n observations should be equivalent to 
half the sum of the two partial averages, m = (m, + mz)/2. 

However, this conclusion is wrong. Indeed, as it can be easily verified: 


k n 


3 ae + 

= Xj = |= xi + — Xi 
nai ake nk. : 
i=1 i=l i=k+1 


The explanation of this apparent paradox is subtle and conceptually important: it 
is wrong to average the two partial means, because, if the two subsamples have a 
different number of events, they have different variances o7/k and o7/(n — k). 
We can then think of “weighting” the two averages in order to assign a greater 


446 10 Statistical Inference and Likelihood 


importance to those giving a more precise estimate. If we define the inverse of the 
variance as a weight, then p) = (o7/k)~! and p2 = [a7 /(n — k)]~!, and one has: 


1 n 
m= — Xj + Kel 
ie a i ~ D1 pit pr [ed 2H P27 =e p> ] 


This is the right solution, as we will now show in a general way, deriving the 
weighted average formula. 

We specialize Eq. (10.50) to the case of n  mdependent Gaussian observations 
divided into k SUBETOUDS of size nj; each one _,;nij = n), all having mean yw but 
different variance a? for each subgroup. The negative logarithm of the likelihood 
function is: 


k k 2 
1 1 ij — 
-InLanes 9 =-Yomn(—) +5 yy 
(10.65) 


where x;; is the j-th observation in the i-th subgroup. If we assume o; as known, the 
first term of this expression does not depend on any parameter and can be neglected 
in the minimum search of £, which becomes dependent on yu only. If m; is the mean 
of the i-th subgroup, the minimum condition is then given by: 


d£ 1 dy? Si = nj ume ni 
aaa viata rena ius —=0, 10.66 
du 2 du a pes FE ae 
that is: 
‘m;)/o2 
(=e ae (10.67) 
ii / 0; 


which is the well-known weighted average formula. This formula gives the data 
“center of mass”, by weighting each term by: 


pee. (10.68) 


If all the data come from the same population, then o; = o, and Eq. (10.67) becomes 
the usual sample mean formula: 


x OD ee Xij 
Pee Say ee 


10.8 Weighted Mean 447 


To determine the statistical error of the weighted average, the transformation 
law (5.74) for independent variables must be applied to Eq. (10.67), as in the case 
of the sample mean: 


2 
LimMilo? | _ (1! n? Var[ Mi] 
ve di ni/97 l-(=) Zz of : 


Since Var[M;] = o?/ni, one obtains: 


2 
2 1 Nj 1 
2 y age 10.69 
"a (5 1) md 2 ni/o; 


i L 


The weighted average interval estimation at one standard deviation is then given by: 


k 
a 
i=1 1 


nj 
“we ——+ ; j Pi = ae (10.70) 
L 
ba Pi a Pi 
i=l i=l 
If o; = o, this equation transforms into Eq. (6.50). The confidence levels to be 


assigned to the interval are Gaussian (i.e. they follow the 30 law), because they 
refer to linearly combined Gaussian variables. Also for non-Gaussian variables, the 
Central Limit Theorem ensures that normality will be reached for n greater than 
about ten. In practice, often the variances are unknown and are estimated from data 
by setting oj ~ s;. It is possible to show that, also in this case, the confidence 
levels to be assigned to the interval (10.70) are Gaussian when n is large. The 
proof exploits the consistency of s;’s as estimators of the o;’s, the Central Limit 
Theorem for the convergence in distribution of Mj, and the independence between 
observations. A necessary hypothesis for the demonstration is that the weight of 
each subgroup of observations does not become negligible with respect to the others, 
which is defined by requiring that n;/n tends to a constant forn — oo. 

We can now capitalize on our knowledge of estimator theory by asking whether 
the weighted mean is an efficient estimator. Since it is an ML estimator, on the basis 
of Theorem 10.4, we deduce that either it is the most efficient estimator or there is no 
optimal Cramér-Rao estimator for the weighted sums of data. Applying Eq. (10.27) 
to Eq. (10.66), it is immediate to see that the weighted average is the most efficient 
estimator, because it satisfies the Cramér-Rao limit of Theorem 10.3: 


2 ——— 
Snun=(-SBe)= (2 py See) yn, om 
i i J u i 


448 10 Statistical Inference and Likelihood 


which is just the inverse of variance of Eq. (10.70) (here J; (jz) indicates the Fisher 
information for jz belonging to the distribution of X;;). 

In R, the weighted.mean(x, w) routine calculates the weighted average of 
the data of a x vector of weights ww, but we have not found an R code for the error 
calculation. We have then implemented this possibility in our MeanEst routine, 
already described in Sect. 6.9, with the call MeanEst (x, Sigma=sx) , where sx 
is the vector of the true or estimated standard deviations of x. If the vector sigma 
is absent, the non-weighted mean is performed. 


Exercise 10.8 
Using the computer routine random, 20 variates 0 < x; < 1| have been 
extracted from the uniform density. They are reported in the following table: 


0.198 0.530 0.005 0.147 
0.898 0.445 0.573 0.943 
0.127 0.870 0.859 0.608 - 
0.605 0.729 0.160 0.555 
0.202 0.313 0.782 0.112 


Compute the mean of the whole sample, the weighted mean of the first 15 and 
of the last 5 data (the first three columns and the last column of the table), and 
compare the results. 


Answer The mean and the standard deviation of the three samples are given, 
with obvious notation, by: 


m9 = 0.485 , 529 = 0.302 
m5 = 0.489 , 515 = 0.298 
ms = 0.473, 55 = 0.347. 
These data, according to Eq. (3.82), are the variates of a uniform distribution 


with w = 0.5 ando = 1/V12 = 0.289. 
By applying Eq. (6.50) to the three samples, one gets: 


0.302 
m9 € 0.485 + —— = 0.485 = 0.068 , 


20 
0.298 

m5 € 0.489 + —— = 0.489 = 0.077, 
V15 


0.347 
ms € 0.473 + —— =0.47+0.15. 


Ae 


(continued) 


10.8 Weighted Mean 449 


Exercise 10.8 (continued) 
The weighted mean of the two partial averages, which are independent since 
they come from samples without common data, is given by Eq. (10.70): 


0.489 - 168.7 + 0.473 - 41.6 1 
——————__—______—_— + —____. = ().486 + 0.069 , 
168.7 + 41.6 168.7 + 41.6 


where pis = 15/s}; = 168.7 and ps = 5/sz = 41.6 are the weights. This 
result can be obtained with the code: 


> MeanEst (x=c(0.489,0.470) ,Sigma=c(0.077,0.150) ) 


As you can see, the result is practically identical to the total mean m9. If we 
did not know the weighted average formula, we could have applied a different 
estimator, namely, the one given by the arithmetic mean of the two partial 
averages with the relative error obtained from Eqs. (5.67) and (5.74): 


my5 +m5 1 ire Re 
een ge ela 
a 2 2Vis 5 


0.489 + 0.473 
a + 0.5V 0.0777 + 0.1552 = 0.481 + 0.086. 


Is this estimator acceptable? 

If we denote with M, and M, the weighted mean and the arithmetic 
mean, we can verify that these estimators are not biased. In fact, by applying 
Eq. (10.14), we get: 


_ >; Mi = (Mi) _ We 


n 


The crucial difference is that Mp, being an ML estimator, is the most efficient. 
In fact, it can be proved that the statistical error of M; is about 20% greater 
than that of M,. The estimator M, is therefore not acceptable. That is, M; 
correctly estimates an interval that contains the true mean (in a frequentist 
sense), but the width of this interval is greater than that of Mp. 


450 10 Statistical Inference and Likelihood 
10.9 Test of Hypotheses 


In this and in the next paragraphs, we complete the important topic of hypothesis 
testing, which we introduced for the first time at a somewhat intuitive level in 
Exercises 3.13—3.17 and in Sect.7.1, without specifying exactly the alternatives 
against which the null hypothesis was tested. 

After having defined the likelihood function, we are now able to address the 
topic with greater precision, considering a null hypothesis against an alternative 
hypothesis and introducing an optimality criterion for the choice between the two 
hypotheses. However, this subject is much broader, because it is possible to deal 
with cases in which both the null and the alternative hypotheses are actually sets of 
hypotheses. Here we will limit ourselves to give the basic ideas, which however will 
already allow us to define a series of methods that are applicable in a simple and 
direct way to many concrete cases. 

From now on we will assume that the likelihood function L(0; x) is not simply 
proportional but coincides with the considered density function. If the observation 
consists of only one measurement, then L(6; x) = p(x; 0), where p(-) is the p.d_f. 
of the random variable X. Usually the hypothesis is tested by checking if the value of 
an estimator of @ (often a function of a sufficient statistic for 0) belongs to a “critical 
set”. The considered likelihood function considered will then be the probability 
density of the estimator sample distribution. 

Let x be an observation, and consider two hypotheses Ho (the main one, called 
null hypothesis) against H), which represents a possible alternative to Ho. If their 
exclusive effect is to have a different value of the parameters in the density function, 
we can construct, with obvious notation, two likelihood functions: L(69; x) when 
Ho : 6 = 6 and L(6,; x) when Hy, : 0 = 9@,. In order not to burden the notation, 
we will consider here only the one-dimensional case, but all the conclusions we will 
draw also apply to an observation x and/or a set of parameters 0. 

For the testing of hypotheses, the situation is that of Fig. 10.6, and the terminol- 
ogy is that of Tables 10.1 and 10.2, where some new terms appear, besides those 
already known. 

If p(x|Hpo) is the density of the estimator corresponding to the null hypothesis, 
Ho is accepted if tey2 < x < t—a/2, or is rejected if x is in the critical region, 
defined by x < ty/2, x > t-a/2. The area subtended by the critical region is 
the significance level SL, corresponding to the probability of making a mistake by 
rejecting the Hp hypothesis when it is true (type I error). In the case of a one-tailed 
test, in which the hypothesis H corresponds to a single distribution, to the left (or 
to the right) of p(x|Ho), the quantities fg/2 and t}q/2 are replaced only by one 
quantile fy (or tj). 

If, on the other hand, Ho is wrong and the right hypothesis is given by the density 
p(x|H}), the tail area, indicated as 6 in Fig. 10.6, corresponds to the probability 
of rejecting the correct hypothesis, because in this case the null hypothesis Ho 
is accepted (type II error). The area m = | — f is called the power of the test 
and corresponds to the probability of discarding Ho when H is true. The density 


10.9 Test of Hypotheses 


p(x|H;) 


451 


8. > 


e 
t-a/2 


Fig. 10.6 Graphic representation of the quantities involved in the test between two hypotheses. 


The quantiles ¢ refer to the p(x| Ho) distribution 


Table 10.1 The language of statistical tests 


Term 

Null hypothesis Ho 
Alternative hypothesis H 
Type I error 

Type I error 

Significance level SL 

6 area 

Test level a 

Critical or rejection region 
Power of the test 7 = | — B 
More powerful test 
One-tailed test 

Two-tailed test 


Meaning 

Reference model 

Alternative model 

To reject Ho when it is true 

To accept Hy when H, is true 

Probability of type I error 

Probability of type II error 

A priori fixed value of SL 

Interval (x < ty/2 or x > ti-«/2) of Fig. 10.6 
Probability to reject Ho when H, is true 

For a given a, the test with the highest power 
One tail only, to the left or to the right 

Two tails, to the left and to the right 


P(x|H2) of Fig. 10.6 indicates the symmetric situation when the maximum of the 
density relative to the alternative hypothesis H lies to the right of the maximum of 


the null hypothesis. 


We introduced for the first time the power of the test in Sect.7.7, as a mean 
fraction of null hypotheses correctly discarded when there are no models for the 
alternative hypotheses. Here, the power 1 — 6 depends on the alternative hypothesis 


to be examined. 


452 10 Statistical Inference and Likelihood 


Table 10.2 Testing between 


‘ mt True Decision 
two hypotheses: terminology isoiheele | Fig A, 
and corresponding probability 
levels rely Correct decision | Type I error 
l-a a 
A Type I error Correct decision 
B 1-8 


If a significance level a, related to two tail values fg/2 and tj—-q/2 is fixed a 
priori (test level), the null hypothesis Ho is accepted when x € [fa/2, t1—a/2] with 
probability equal to: 


tl-a/2 
P{ta/2 < X < ti-a/2|Ho} = P{X € A|Ho} = / Lo; x)dx =1—a, 
tq /2 


(10.72) 


where A is the acceptance interval [ty/2, t1«/2] for the one-dimensional case and a 
subset of the spectrum of X in the multidimensional case. 

The power of the test is the probability to reject Ho when H is true, that is, the 
probability to obtain results into the critical region when H is true: 


+00 
1- P(x e Alin) = f “Le x)ax+ f L(Q\;x)dx =1—B, (10.73) 


—0o t-a/2 


to /' 


where f is the type II error probability, that is, to accept Hp when H is true: 


Na 
pice alm) =f Hecedpag: (10.74) 


tq /2 


For an ideal test, where the densities corresponding to Ho and HA) have disjoint 
support, 6 = 0 and the power | — 6 = 1 is maximum. 

These are the definitions related to hypothesis testing in the more general case of 
a two-tailed test. For the one-tailed test the rejection region has the form (—oo, c) 
or (c, +00). 


10.10 One- or Two-Sample Tests 


Before examining the optimality criterion for choosing between two hypotheses, let 
us familiarize ourselves with the concept of power in the case of the tests on the 
mean already seen in Sect. 7.2. Suppose we have a sample mean M calculated from 
a Gaussian sample of n events and we want to verify the compatibility with one 
of two theoretical means {4p (hypothesis Ho) and j; (alternative hypothesis H,) of 


10.10 One- or Two-Sample Tests 453 


two populations having the same variance o7. In addition to the test level, we also 
want to check its power, choosing n to have an assigned value of | — 8. Referring 
to Fig. 10.6, if we suppose 41 < (Zo, We can write the probabilities: 


M— 
P| ee 


o/ Jn ~ 


This critical region must have probability of 1 — 6 under H), that is, 


oO 
Hot =PiM< — |ty|—=| Hot =a. 10.75 
o| < Mo — | < o| a ( ) 


o 
P 4M < wo |tal | Mp = 1-86. 10.76 
S Ho tal | i} B ( ) 
This equation is valid if and only if: 
Ital = or + pe = wi t Mtl (10.77) 
HO as a On . 


so that the required minimum n is given by: 


2 

t t 

= le rl ; (10.78) 
HL — Mo 


If “1 > wo, Eqs. (10.75) and (10.76) become: 


P ae || Hop = PyM= + |t Hoy = (10.79) 
ee = a =a, L 
de Ol 0 Lo a Jn 0 
P\M>uwuoct Ital AL; =1-8 (10.80) 
0 a vn 1 , js 
giving the condition: 
oO Oo 
+ |t¢|—= = — |ts|—, 10.81 
Lo Meal M1 Bl ( ) 


which again leads to Eq. (10.78). 

In the case of a two-tailed test, one proceeds as above, taking both cases 1 < Mo 
and 49 < (41 into consideration, but with the quantile w/2 instead of a. One then 
gets Eq. (10.78) again, with ta/2 replacing fy. 

If one substitutes o with the sample standard deviation s, the Student quantiles 
must be used. Since these quantiles depend on n, a closed-form solution as in the 
Gaussian case is no longer possible, and one must iterate over n until the solution 
is reached. This calculation is performed by the R routine power. t .test, which 
requires the difference jz; — jz2 as input, an assumption on the value of o and on 


454 10 Statistical Inference and Likelihood 


those of a, 1 — f and n. Giving four of these five values as inputs, the routine 
calculates the missing one. An example of use is given by the following exercise. 


Exercise 10.9 

It is assumed as a null hypothesis Ho that a variable X is Gaussian with mean 
fo = 10 and standard deviation o = 10. Find the optimal sample size n and 
the critical region to accept Hp with a = 0.05 and power 1 — 8 = 0.95 against 
the alternative hypothesis H; of a Gaussian with the same o = 10 and mean 
[41 = 20. Consider also the test witho ~ s. 


Answer We are in the case of the one-tailed test; hence we use the Gaussian 
quantile |t2| = |fo.95|. We then obtain, from Eq. (10.78): 


10 - (1.645 + 1.645) ]7 
_ | Se ee aa. 
20 — 10 


The critical value m is obtained from Eq. (10.81): 


10 10 
m = 10+ 1.645 —— = 20 — 1.6445 ~ 15. 
10.8 10.8 


The required test must then sample n = 11 values of X, calculate the sample 
mean M, accept Ho if {M < 15} and accept Hy if {M > 15}. If we assume 


that o ~ s = 10, we can use R with the call: 
power.t.test (delta=10,sd=10,sig.level=0.05,power=0.95,alt='one’, 


iEee=" ola" )) 
to obtain, as a result, n = 12.3. Therefore: 


10 
m = 10+ 1.78——— = 15.1, 
V¥ 12.3 


where 1.78 is the Student quantile fo.95 with 12 degrees of freedom. The call 
delta= sl; —Mo, Sig. level=a, alt refers to one-tailed test and type 
to the single sample case. In summary, we obtain n ~ 12,m = 15. 


In the two-sample problem, the question is whether they come from populations 
with the same mean. In this case, it is necessary to find the smallest common 
dimension 1 of these two samples in order to distinguish between the two means, 
with the predefined test levels. To solve this problem it is enough to replace M by 
M, — Mo, lo by zero, and p41 by 41 — Lo in Eqs. (10.75)-(10.78). Moreover, the 
uncertainty on the difference between the two means 0 /T/n + 1/n = V20/./n 
must substitute the error 0 /,/n in the denominator. The result 2 is immediately 
obtained for the number of events, where n is from Eq. (10.78). In the two-sample 


10.10 One- or Two-Sample Tests 455 


case, the rule is therefore to double the result of the one sample case. Ifo ~ s, 
the Student’s density calculation gives a value slightly smaller than the one-sample 
doubled value. If we reconsider Exercise 10.9, in the Gaussian case, we obtain 
n = 21.6 and n = 15, whereas with the Student’s density with the call to 
power.t.test with alt=’ two’, we obtainn = 22.4 and m = 15.1 for the 
Student quantile fo.95 = 1.71 with 23 degrees of freedom. 

The same type of test is often used with frequencies. Using the Gaussian 
approximation, if po and p, are the true probabilities under Hp and H, respectively, 
and a frequency f is measured, from Eq. (3.6) and when pi < po, Eq. (10.77) 


becomes: 
o(1 — po) 1d. — pi) 
Po— lta ———~ = pi + Ital ——— (10.82) 


whereas, when p; > po, it is necessary to exchange the signs on both sides of the 
equation. 

Since we are using the Gaussian approximation, these formulae are typically used 
forn > 5. The minimum zn value is obtained by repeating exactly the procedure that 
led to Eq. (10.78): 


_ [leo pol + lV prTT= OY Aces 


|P1 — pol 


In the two-sample case, the procedure for the left-tailed test is based on the 
distribution of the difference between the measured frequencies f| — fo, similar to 
that just seen for the two Gaussian samples, except that now the variance of fi — fo 
depends on the values assumed for p; and po. Under Ho, when pi = po = Pp, 
the variance of f; — fo is 2p(1 — p)/n and usually p = (p1 + po)/2; under My, 
when p; < po, the variance of fi — fo is [pi(! — pi) + poU — po)]/n. Under the 
Gaussian approximation, we use Eq. (10.82), replace the standard deviations at the 
first and second member with those just calculated and substitute the first po at the 
first member with zero and the first p; at the second member with (p; — po). In the 
end, we get the following result: 


_ [ tetv@eo =P) =P) + ItslV PoC — po) + Pi — Pi) po] | rr 


Pi — Pol 


also valid for the one-tailed right test. We recall again that for the two-tailed test, the 
quantile fy must be replaced with fa/2 in Eqs. (10.83) and (10.84). 

In R, the test on two binomial samples (proportions) is done by the 
power.prop.test routine, which takes as input the values po, pi1,a, 1 — 6,n 
and gives in output the value that is not entered as input. This routine does not 
consider the one-sample case of Eq. (10.83). 


456 10 Statistical Inference and Likelihood 


Finally, we recall that the R library pwr includes many routines that calculate 
various one- and two-sample cases for the currently used statistical variables, also 
including the cases of samples with different size. 


Exercise 10.10 

A null hypothesis Ho with po = 0.01 is assumed. Find the optimal sample 
size and the critical region to accept Hp with a = 0.05 and power 1— 6 = 0.80 
against the alternative hypothesis H; that pj = 0.02. Carry out the one and 
two-sample test. 


Answer Using Eq. (10.83) for the one-tailed test, one obtains: 


1.645/0.099 + 0.842./0.140 : 793 

ie 0.01 aves 
From Eq. (10.82) the limit of the critical region is given by f = 0.01 + 
1.645./0.01 - 0.99/793 = 0.02 — 0.842./0.02 - 0.987793 = 0.0158, corre- 
sponding to a number of successes x = 0.0158 - 793 = 12.5 ~ 13. If x < 13 
Ho is accepted; otherwise Hj is chosen. 

For the two-sample case, from Eq. (10.84), one obtains n = 1845. The call 
[SOMES Oko). ESSE (Ol=0), Ol, A=O 02, SiGe.) soneSse=O . 8), llic=" ils” )) . 
provides the value n = 1826, corresponding to f = 0.0138 and x = 
25.2 ~ 25. The R routine contains continuity corrections that are absent in 
Eq. (10.84). 


10.11 Most Powerful Tests 


Suppose you are testing a filter for the identification of email spams. This filter 
analyses mails and assigns a score. After the analysis of correct emails and spams 
performed by the filter, two histograms of the scores similar to distributions of the 
type shown in Fig. 10.6 are obtained, where on the x axis, there are the evaluated 
scores and on the y axis the number of emails with that given score. If the score is 
higher for spams, we will have two distributions like p(x|Ho) for good mails and 
p(x|H2) for spams. If the two distributions are disjoint, selecting good emails would 
be trivial. If instead the two distributions partially overlap, as in Fig. 10.6, we should 
devise a selection criterion, that is to say, to evaluate the score above which an email 
is discarded. The first parameter to take into consideration is the type I error a2: in 
this case we have to fix it as small as possible, to avoid discarding good mails. Then, 
we will set an upper score limit, allowing for some spams to enter the system. At 
this point the 6 error determines the percentage of spams accepted and the power 


10.11 Most Powerful Tests 457 


1 — B the percentage of those correctly rejected. In other types of problems, as in 
production quality controls, where it is more important to avoid a “dirty” sample 
than to miss good events, a larger value of a must be chosen to decrease that of 6. 

This example clearly shows the standard procedure used to perform statistical 
tests. At first, the level of significance aw appropriate for the problem is fixed a priori; 
then an estimator T;, = t,(X) is chosen (the mean, the variance, the chi-square, etc.), 
the sample distribution of the estimator is found, and its quantile values fy or tia 
are determined. If ~ remains fixed and the estimator is changed, the power of the 
test associated to the alternative hypothesis Hj turns out to depend on the chosen 
estimator. Now let us ask ourselves the fundamental question: given an observation 
of a variable X, which is the most powerful test, that is, the one that has the best 
critical region? 

The question may in some cases have considerable practical interest, because 
statistical tests often have a high economic and/or management cost (as an example, 
think of quality control tests in industry). It is therefore important to choose the most 
powerful test, for which the type I error is minimal and, consequently, the decision 
criterion is the most reliable. An answer to this issue is given by the Neyman- 
Pearson(NP) theorem: 


Theorem 10.6 (Neyman-Pearson) Let 0 be a parameter of a likelihood function 
and consider the null hypothesis Ho and the alternative one Hy: 


Ho :0=6, H,:0=0,. 
The likelihood ratio: 
L(0o; X) 
R(X) = ——_ ., 10.85 
(Xx) L(. X) ( ) 


takes large values if Ho is true and small values if H, is true. The most powerful test 
among all those of level a is given by: 


(10.86) 


reject Ho if {a0 = a te , 


L@;%)~ 
where ry is the R(X) value of Fig. 10.7, such that P{R(X) < rqy|Ho} = a. 


Proof Let 2 be the Hp rejection region (of a level) for the NP test and 92’ be the 
rejection region (of a’ level) for any other test. The theorem holds if it is proved that 
the probability of the type II error for the same @ level is minimal for the NP test: 


a 


a=a ==> Bp’ > B - (10.87) 


By assumption, one has: 


a= [ LQ: x)ax =a! = f L(@0; x) dx . 
2 Q! 


458 10 Statistical Inference and Likelihood 


\ P{R(X) | Ho} 


\ 


Ty, 


T 


Fig. 10.7 Using the R likelihood ratio, the null hypothesis is rejected if the value of the ratio is 
lower than the limit rg, which determines the significance level given by the shaded area 


If 6’ ~ B one can write (see Fig. 10.6): 
p'-B=AB= 1-f L(Q; x) dx — li - f L(x a 
Q! Q 
= L(Q@\; x) dx -{ L(Q; x) dx . (10.88) 
Q 2! 


If x € 2, Eq. (10.86) is always true: 


POU 35. ak L(61; x) > : L (60; x) 
eS r, . —y . 
L(61; x) —‘a 1%) = Ty 05x), 


whereas, if x € (2! — Q): 


L(@, x) 


1 
palin L(6\; — L(00; x) , Q. 
LO:x) = Che. (00;x), x¢ 


By replacing these last two relations in Eq. (10.88) and taking into account that the 
integrations on 82 M @’ do not contribute, Eq. (10.87) is obtained: 


ap> =| f LQ: x) ax — f Livi) dx | = Eta!) = 0, 
2 Q! 


To To 


This proves the theorem. oO 


10.12 Test Functions 459 


We now apply this theorem to test the hypothesis of two different means Ho : uw = 
Lo against Hy : 4 = 4, with samples extracted from Gaussian populations of equal 
variance o”. Under these conditions, the ratio (10.85) becomes: 


Lo 


Fag exp { [Dots wy — i wo?}} Sa, 


1 


and hence, using logarithms: 


2(Wo — M1) ) xj S 207 Inrg + n(UG — 117). (10.89) 
i 


Dividing by 2n(t1o — L41), if (U¥o — 1) > O one obtains: 


1 o Lo + 1 
mn =— >) xj <———— nrg + ——— = mo. (10.90) 
" no n(uo— 1) 2 


In this case the Neyman-Pearson test with the explicit calculation of ry is equivalent 
to the test on the sample mean: Hp is discarded if {M, < mo}. The value mo is 
determined by the type I error under A: 


M, — = 
a= P{M, < mo) = P| n— KO mo ae 


o/yn ~ o/Jn 


This condition holds when mg satisfies the equality: 


iia ee —=>7 6 Ss 
o//n vn 
where ty is the a quantile of the standard Gaussian. If wo < (41, it is necessary 
to change the sign of the inequality (10.89), and Hp is discarded if {M, > mo = 
bo + (a//n) ta}. 
It is also easy to verify that the test on the sample mean is the most powerful even 
for exponential and Poisson distributions. 


10.12 Test Functions 


Neyman-Pearson’s test requires the knowledge of the density p(r) of the likelihood 
ratio R. In principle, it is always possible to get this by using the techniques 
developed in Chap.5 for functions of random variables, but sometimes these 
calculations turn out to be difficult and laborious. 


460 10 Statistical Inference and Likelihood 


a) , b) 


Fig. 10.8 When the likelihood ratio can be expressed as a function of another statistic, the interval 
R < fg relative to R can correspond to an interval T < ty (a), T > ty (db), fe < T < ti (©), 
T <ty, t, < T < 4% (d), depending on the function R = y(T) 


Fortunately, as we just saw in Eq. (10.90), the problem is simplified drastically if 
R can be written under the form: 


R=y(T), (10.91) 


where T is a statistic with a known distribution. 

In fact, a test using the T operator determines the limits ty corresponding to the 
chosen significance level; using Eq. (10.91), one can therefore determine the limit 
rq and apply Theorem 10.6. However, this does not necessarily have to be done in 
practice: if Eq. (10.91) holds, the R(X) test can be replaced by a T(X) test, which 
also satisfies the maximum power property. The functional relation between R and 
T given by Eq.(10.91) can turn the one-tailed test R(x) < rq (see Eq. (10.86)) 
into a one-tailed test to the right or the left, a two-tailed test, or a test with a more 
complicated critical region, as shown in Fig. 10.8. For example, from Eq. (10.90) 
the quantity R is a function of the sample mean. It follows that the mean estimator 
T = >) X;/n is the most powerful test function of the mean for samples extracted 
from Gaussian populations. This function maximizes | — 6 once the a test level has 
been chosen. 


10.12 Test Functions 461 


Exercise 10.11 

A company wins a tender for the supply of an electronic component stating 
that the percentage of defective parts at the origin is less than 1%. Find the 
limit below which the number of defective pieces found in a control batch 
of 1000 pieces must remain to be confident, at a level a = 5%, that the 
firm’s statement is correct. Also find the power of this test with respect to 
the alternative hypotheses of a defect probability of 2% and 3%, and verify if 
the test has maximum power. 


Answer If the defect rate is 1%, the number of discards on a batch of 1000 
pieces is a binomial variable with an expected value (true average) equal to 10. 
Since the probability is small (0.01) and the mean value is large, it is possible 
approximate the binomial using a Poisson distribution. We are brought back 
to the case of Exercise 10.10, with the difference that here n is fixed and 6 
unknown. Since the problem requires a significance level of 5%, from Table 
E.1 and from Eqs. (3.43) and (3.44), we see that the one-tailed test: 


PIT > tia} = 1— P(t-g) = 0.05, 


is satisfied for a value of the standard variable T = t)_y ~ 1.645. If S is the 
number of discards, from Eq. (3.37) we have: 


S-p _ $—10 


oO 


> 1.645, and hence: S$ > 15.2. 


Since the considered variable is discrete, there is no critical value that 
coincides with the assigned SL value. At this point we could randomize the 
test by setting in Eq. (7.4) a, = 0.05 and: 


15 — 10 
10 
16 — 10 
10 


where Table E.1 has been used. As shown in Problem 10.15, the randomized 
test accepts the batch one time over four when S = 16. Here we prefer to 
proceed in a simpler (even if approximate) way, adopting as a decision rule the 
acceptance of the batch if $ < 15 and its rejection if more than 15 defective 
pieces are found. 

The power of the test is given by the probability 1 — £ to discard Ho (1% 
defect rate) when the alternative hypothesis Hj is true. We recall again that 


SLa = P(S > 15) = 1-0 ( ) = 0.057, 


SL = P{(S > 16) = 1-0 ) = 0.029, 


(continued) 


462 


Exercise 10.11 (continued) 

B is the area of Fig. 10.6, which means the probability of the type II error. If 
H refers to a defect rate of 2%, from Eq. (3.43), and using the R statistic, we 
obtain: 


1—fB=1- P{S < 16; Hj} = P{S > 16; Hi} (10.92) 


=1-0 (272) <1 norm((16 — 20)/sqrt(20)) ~ 0.81 
_ =) <1» & 208i 


The power of the test is about 80%, and the probability to accept Hp when A; 
is true is about 20%. 
The same calculation when H; refers to a 3% defect rate, gives: 


1-p=1 » (=) = 099s (10.93) 
= ag) 0.995. 


Therefore, the hypotheses Hp : 1% and H; : 3% give distributions having 
basically a disjoint support. In practice, the type II error occurs with a 
nonnegligible probability only for the alternative hypothesis of a defect rate 
of about 2%. 

Due to Eq. (10.90), this is the most powerful test. We also verify the results 
by considering the Poisson distribution (3.14) to be the likelihood function 
approximating the binomial distribution. When the alternative hypothesis H; 
assumes a defect rate of 2%, we write the likelihood ratio (10.85) as: 


S ,—10 S 
ee (5) = ws), (10.94) 
&@ 


which shows that Eq. (10.91) is valid. Since the exponential factor is constant, 
the link between R and S decreases monotonically, as in Fig. 10.8b. The 
inverse function S = w—!(R) is: 


is 


10.95 
In2 ( ) 


We then determine an interval $ > 15.5 (intermediate value between the 
discrete limit values found on S) and an interval on R approximately equal to 


R27, 6°05)? 2047. 
to reject the null hypothesis at the chosen test level. 


The test has then the maximum power among all possible tests at the 5% 
level. 


10 Statistical Inference and Likelihood 


10.12 Test Functions 463 


Up to now we have considered alternative hypotheses of the type Hj : 6 = 6, 
called punctual or simple. Let us now consider the case of alternative hypotheses 
called composite, which include a set of values, such as H : 6 > 61. In these cases 
we need to explore the power of the test for the whole set of parameter values of the 
alternative hypothesis, in order to find the uniformly most powerful test. In these 
cases an extension of Eq. (10.85) is usually used, given by the generalized likelihood 
ratio: 

L(60; X) 


R = ————_. 10. 
maxgL(0; X) mee) 


This equation represents the ratio between the likelihood calculated at the parameter 
optimal value and the value L(0, X) calculated at the maximum. We do not address 
this rather complex topic, but we limit ourselves to observing that the generalized 
likelihood ratio test usually performs well in this context, because likelihood is 
always a function of a sufficient statistic of the problem. 

In all cases (actually not many) where the power can be explicitly calculated, 
one can still determine for which alternative hypotheses a chosen test is sufficiently 
powerful by examining the power function, defined as a function of the parameters 
of the alternative hypothesis as: 


n(0) =1—B(0) =1— P{X € A; Mj}, (10.97) 


where A is the acceptance region of Ho, that is, the complementary set of the critical 
region. 


Exercise 10.12 
Find the power function for the case of Exercise 10.11. 


Answer The decision criterion obtained consisted in the rejection of the 
hypothesis Hp for a number of defects S > 16 on a batch of 1000 elements. 
Using the Gauss approximation (10.92) of the binomial density, the power 
function (10.97) results: 


“| 
ee 10.98 
(KL) ( wi ( ) 


(continued) 


464 10 Statistical Inference and Likelihood 


Exercise 10.12 (continued) 
This curve can be obtained with the R instructions: 


> mu<- seq(0,50,by=0.2) 

> pow<- 1-pnorm((16-mu) /sqrt (mu) ) 
> plot (mu, pow, type='1’ ) 

= jpeael|()) 


where mu is the expected value of the number of defective pieces. The 
function is shown in Fig. 10.9. It shows that the support of the sample 
distribution under Ho (defect rate of 1%) becomes practically disjoint from 
the one under H; when the alternative percentage of defects is larger than 
3%. We arrived more intuitively to the same conclusion in Exercise 10.11. 


1 - B((16-W)/V) 


0 10 20 30 40 50 
number of defects u 


Fig. 10.9 Power function for the case of the Exercise 10.11 


10.13 Sequential Tests 465 
10.13 Sequential Tests 


In addition to the transformation (10.91), the difficulty in finding the distribution of 
the Neyman-Pearson variable R can be overcome also in another very elegant, but 
approximate, way. 

Consider two hypotheses Ho and H; with the corresponding likelihood functions 
L (09; x) = Lo(x) and L(61; x) = L(x). The ratio: 


R=R(X)= 


=— 10.99 
LyX) Li 


tends to assume large values if Ho is true, small values if Hj is true. We can then 
define the test of the two hypotheses as follows: 


accept Ho if (Lo/Li) > ru = Lo=rxli 
accept H, if (Lo/L1) <1rn => Lo <rpL (10.100) 
no decision if r_ < (Lo/Li) <ry 


The areas corresponding to the probabilities of type I and type II errors a and 6 are 
shown in Fig. 10.10. 

For a given observation of dimension N, the spectrum of the variable R is then 
divided into three disjoint regions, Ro, Rj and R,, corresponding respectively to the 
decision to accept Ho, to accept Hj and to not make any decision. 

The interval Ro then corresponds to the condition R > ry, whereas Rj is the 
interval R < rp. From Fig. 10.10 and Eqs. (10.100), it is clear that the probabilities 


Th Ty 


Fig. 10.10 In hypothesis testing, when using the likelihood ratio r, Ho is accepted if R > ry; Ay 
is accepted if R < rj; no decision is taken ifr, < R < ry 


466 10 Statistical Inference and Likelihood 


to assume correct decisions are given by : 


a i Lo(x) 
-a= R(x)Lo(x) dx > ry —_ L(x) dx 
Ro Ro L(x) 
— ru | R(x) Li(x)dx =ryx 8B, (10.101) 
Ro 
1 Lo(x) 
1-Bp= / R(x) Ly (x) dx > — Lo(x) dx 
Ri Th JR, Li(x) 
1 a 
= — R(x)Lo(x) dx = — , (10.102) 
rh R, Th 


where, as usual, a and £ are the type I and type II error probabilities. From these 
inequalities, the lower limit of r, and the upper limit of ry can be immediately 
obtained as: 


(10.103) 


In conclusion, Eq. (10.100) can be described as follows: in a test of Hp against M1, 
when a and £ are fixed a priori, if: 


R< =, <rpn accept A , (10.104) 
eg eee decision is tak (10.105) 
<R< no decision is taken , ; 
l=, B 
l-a 
R> r >ry_ accept Ho. (10.106) 


This test is approximate but is independent of the knowledge of the probability 
distribution of the likelihood ratio (10.85). 

Which type I and II errors a’ and fh’ are actually associated with the test 
resulting from Eqs. (10.104)—(10.106)? Indicating the new threshold values of the 
approximate test with r}, and r;, a partial answer to this question can be given 
because one always has a’ + B’ < a + B. Indeed, from Eqs. (10.101)—(10.103) one 
easily obtains: 


l-a 


3 


l—a' > p’r, = B’ 


10.13 Sequential Tests 467 


so that: 


and hence: 


a’ (1—B) +p’ —«@) <a(1—B) + BU a’) = a +B’ <at+B. (10.107) 


Exercise 10.13 
Perform the approximate likelihood ratio test on data from Exercise 10.11, 
assuming a type I error of 5% and type II error of 20%. 


Answer In Exercise 10.11 we studied the probability function of the sum 
variable S, connected to the Neyman-Pearson variable R by Eq. (10.95). We 
found that the hypothesis Hp of a 1% defect rate was discarded, at a level of 
significance a of 5%, for a number of defective pieces greater than 15 on a 
batch of 1000 pieces. We had also found a probability 6 of about 20% for the 
type II error when the hypothesis H; has a defect rate of 2%. 

Also in this case the levels a = 0.05 and 6 = 0.20 are requested. From 
Eqs. (10.104)—(10.106) one obtains: 


accept Hpo if R> 4.75 
no decision if 0.0625 < R < 4.75 
accept Hy if R < 0.0625 


From Eq. (10.95) it results that to the values corresponding respectively to 
ry = 4.75 and ry, = 0.0625 are: 


Gp = Ith4, Sse 


The hypothesis (Ho) of a defect rate less than 1% should be accepted if 
the number of defective pieces remains below 12, whereas when this value 
exceeds 18, the hypothesis Hj of a defect rate greater than 2% must be 
preferred. Values between 12 and 18 relate to an estimated defect rate between 
1% and 2% and therefore represent an area of uncertainty where it is advisable 
not to make any decision. 


The fact that the limits in Eqs. (10.104)-(10.106) do not depend on the sample 
size N suggests an application of the likelihood ratio method to be carried out 
iteratively, as N increases, during an experiment. The test is performed until when 


468 10 Statistical Inference and Likelihood 


the results remain in the zone of indecision. More precisely, the sample size is 
increased if: 


th< Rn <ry, N=1, (10.108) 


and the test is stopped as soon as Eq. (10.108) is not satisfied. The limits of the 
indecision interval are to be specified according to the level of the test and the type 
II error. 

Notice that both the sample size N and the final likelihood ratio value 


_ Lo 60; Xn) 


Se. (10.109) 
" L\(@; Xn) 


have to be considered as random variables when the indecision zone is left. These 
tests are called sequential. 

If we assume that we have determined r, and ry so that the probabilities of 
type I and II errors are a and f, Eq. (10.103) is valid as well. Hence, in practice, a 
sequential test of approximate level will have as an area of indecision: 


tps"... Wea (10.110) 
—— < Ry < —., > 1. ; 
1-86 B 


Also in this case, Eq. (10.107) determines the actual levels of the test: a’ + B’ < 
a+. 

Although the sequential test is approximate, it requires on average a smaller 
sample size compared to tests with fixed N to reach the same power, thus giving 
a considerable saving in sampling time, even of 50%. Intuitively, this happens 
because there are often sequences of favourable (or unfavourable) cases that quickly 
give values of Ry outside the indecision interval (10.110), allowing to immediately 
choose the correct hypothesis (Hp or Hj). 

We omit here the formal demonstration of this very useful property; it can be 
found, under very general conditions, in [MGB73]. Instead, in the next exercise, we 
will present a different approach based on simulation techniques. 


Exercise 10.14 
Apply the sequential test with variable N, to the data of Exercise 10.11, 
assuming a = 0.05, 6 = 0.20 and a number of defective pieces of 1% 


(Ho) against the alternative hypothesis of 2% (H). Estimate, with simulation 
methods, the distributions of the number of defective pieces Sy and of the 
sampled pieces N. 


Answer If N is variable, Eq. (10.94) can be written as: 


(continued) 


10.13. Sequential Tests 469 


Exercise 10.14 (continued) 


ley 
Ry (5) eN/100 


where Sy is the number of defective pieces. Passing to logarithms one has: 


N SN N 
FP oy Sar aan ee 
oS Be eae Oe 


and Eq. (10.110) becomes: 


a 1.44 1.44 l-a 
1.44 In — —WN < —Sy < ———N +1.44 In : 
100 100 B 


Changing the signs and inserting the values a = 0.05 and 6 = 0.20, the 
indecision interval, where the sampling continues, becomes: 


— 2.25+ 0.0144 N < Sy <3.99+ 0.01440 . (10.111) 


Hp is accepted for Sy values to the left of this interval, whereas, for values to 
the right, the alternative H, is chosen. 

The two discrete random variables Sy and N appearing in Eq. (10.111) 
have a distribution difficult to be determined with analytical methods. How- 
ever, we can simulate them, proceeding as follows: consider a number 0 < 
€ < 1 supplied by the rndm routine and increment WN by one; if € < 0.01 
(hypothesis Ho), Sy is also increased by one; if Sy and WN satisfy Eq. (10.111), 
a further step is needed, if Sy > 3.99 + 0.0144N then Hj is chosen; if 
Sy < —2.25+0.0144.N, Hp is kept. The same test can be simulated under 
the alternative hypothesis MH, increasing Sy if € < 0.02. The results, obtained 
by repeating the computer test for 100,000 times with our code Sequen are 
displayed in Fig. 10.11. The simulation shows, in Fig. 10.11a and b, that the 
average number of pieces sampled is 451 + 1 under Hp and 570 + | under 
H. This number is about half of that required for the fixed sample test with 
the same values of a = 0.05 and 6 = 0.20, which is approximately 800, as 
shown in Exercise 10.10. It can also be noticed that N has a distribution with 
an exponential-like tail on the right, with a small (but not negligible) number 
of tests reaching larger values than the sample size needed in the fixed N test. 
The density of Sy below Ho, shown in Fig. 10.11c, is of exponential type, with 
an average value of about five pieces. In the top part of Fig. 10.1 1a and b, an 
estimate of the “experimental” values of the levels 1—a’ and 1—’ is reported, 
as the fraction of tests where Hp was accepted when true (Fig. 10.1 1a) and the 
same for H; (Fig. 10.11b). From the data the values of a’ € (3.8 + 0.1)% 
and 6’ € (19.3 + 0.1)% are obtained (check, as an exercise, the values of the 


(continued) 


470 


Exercise 10.14 (continued) 
statistical errors). They do not coincide with the a priori fixed values because 


the test deals with discrete variables. The sum a’ + f’ € (23.14 


10 Statistical Inference and Likelihood 


E 0.1)% is 


less than the value a + B = 25%, according to Eq. (10.107). In Fig. 10.11d 
the values in the plane (NV, Sy) obtained in 100,000 tests under the two 
hypotheses are displayed. The points are arranged just above or below the two 
boundary lines of Eq. (10.111). This curve allows the graphical determination 
of the indecision interval; for example, when N = 500, if Sy < 5 one chooses 


8000 
20000} 2 
1-c.'=96174/100000 1-B'=80692/100000 
m=455 | 6000 m=568 
15000} 5=360 s=423 


0 1000 =2000 3000 0 1000 =©2000 =3000 
N (hypothesis H,) N (hypothesis H,) 
50 
co} s_ d) 
20000 m=4.7 vs 
15000 oe 


0 20 40 
Sy (hypothesis H,) 


500 1000 1500 2000 


N 


Ho, if Sy > 11 one chooses Hj. Notice also that the discrete structure of the 
problem is clearly visible in each histogram. 


Fig. 10.11 Results of the simulation of 100,000 sequential tests of Exercise 10.14. Histogram of 
the number JN of pieces to be sampled if the defect probability is 1% (hypothesis Ho) (a); the same 
if this probability is 2% (hypothesis H) (b); histogram of the number of discards Sjv under Ho 
(c); points in the plane (NV, Sy) corresponding to the acceptance of the two hypotheses (d) 


10.14 Problems 471 
10.14 Problems 


10.1 An urn contains black and white marbles in the proportion of 2 : 1 or 1 : 2 
(it is not known in favour of which colour). In the case of four extractions with 
replacement, find the ML estimate of the proportion as a function of the possible 
results of extracted black marbles (x = 0, 1, 2, 3, 4). 


10.2 Using the method of the previous problem, find the ML estimate of the 
probability p as a function of the successes x obtained in n = 3 attempts, 
considering the nine possible values: p = 0.1, 0.2, 0.3,..., 0.9. 


10.3 If 2 and G are ML estimates of jz and o, find the ML estimate of the quantile 
F(%qy) = a. 


10.4 Calculate the ML estimate of A using n observed split times ¢; from a 
population having exponential density A exp[—Ar]. 


10.5 A method used to estimate the number N of elements in a finite population 
is to take a random sample of n elements, mark them and re-enter them into the 
population. A second sample of n elements is then drawn, and the number x of 
marked elements is determined. If in an experiment we fix n = 150 and find x = 37, 
calculate the ML estimate of N, (a) exactly and (b) under the approximation N > n. 
Hint: use the hypergeometric law (1.33). 

What methods could be used to roughly determine the interval estimate of N, i.e. 


A 


NeN+o[N]? 


10.6 If X is a Bernoulli variable with p.d.f. b(x) = p*(1 — p)!~*, verify whether 
the statistics S = X; + Xo and P = XX? are sufficient. 


10.7 The sum w = )o; a is obtained from a sample of size n. Knowing that 
the variable has the normal distribution N(0, 07), with a known zero mean and 
variance to be determined, apply the methods (a) and (b) of Sect. 10.5 to determine 
the confidence interval of o. 


10.8 A sampling from a normal population N(j, 07) with 4 = 0 has given the 
values x; = 1.499, 5.087, 0.983, 2.289, 1.045, —1.886. Find the ML point and 
interval estimate of the variance 7, with CL = 95.4%. 

Then, calculate another estimate using Eq. (6.76). Which method is preferable? 


10.9 From 100 simulated values of X ~ N(0,o7), a sum w = >; ra = 1044 is 
obtained. Estimate the variance with the methods used in the two previous problems. 


10.10 Two samples of ball bearing diameters (in cm), produced by the same 
machine in two different weeks, have number of events, mean and standard 


472 10 Statistical Inference and Likelihood 


deviation, respectively, equal to ny = 100, m, = 2.08, s; = 0.16 and nz = 
200, m2 = 2.05, so = 0.15. Calculate both the ML and the interval estimate of the 
mean diameter value. 


10.11 A radioactivity monitor detects the presence or not of nuclear particles 
without counting their number. From a measurement of 50 homogeneous random 
samples with the same amount of substance and within the same unit time interval, 
45 of them tested positive by the monitor. Find the ML estimate of the average 
number of particles emitted by that amount of substance per unit time. 


10.12 The recording of N = 1000 arrival times provided the histogram: 


t 2 4 6 81012 14 16 18 20 
n(t) 472 276 13251361211 7 1 2 


which shows that between 0 and 2 s, there were 472 split times, between 2 and 4 s 
276 split times and so on. Perform the best fit analysis using the exponential density 
e(t) = Aexp(—At) by determining the ML estimate of 2. Verify the validity of the 
hypothesis with the x? test. 


10.13 Under Ho, a binary variable X assumes values between O and 1 with 
probability 1 — ¢ and ¢, respectively: P{X = 0,1; Ho} = (1—«),¢e.Let0<e <1 
be a small positive number. Consider the alternative hypothesis P{X = 0, 1; Hj} = 
€, (1 — €). Calculate significance level and the power of the following test: accept 
Ho if in an experiment {X = 0}, whereas H is accepted if {X = 1}. 


10.14 In the situation of the previous problem, consider the result of two indepen- 
dent trials and the following test: Ho is accepted if {X1 = 0, X2 = 0}; Hj is chosen 
if {X; = 1, X2 = 1}; acoin is tossed to randomly choose between Ho and H, when 
{X, + X2 = 1}. Find the significance level and the power of this test, and compare 
the result with that of the previous problem. 


10.15 How is the randomized test performed in the case of Exercise 10.12? 


10.16 A car company uses suspensions withstanding an average of 100 h in an 
extreme fatigue test (Hp hypothesis). A supplier offers a new type of suspension 
claiming an average of 110 h (alternative hypothesis H)). Knowing that the lifetime 
distribution is negative exponential, design a test on the new type of suspension. (a) 
Keep the sample size fixed, use the normal approximation for the mean and make 
sure that the probability of being wrong if the new suspensions are the same as the 
old ones is 1% and the probability of accepting good suspensions is 95%. (b) Now 
solve the problem with a sequential test and evaluate, with a simulation code, the 
distributions of the average times and of the number of pieces required for the test. 


10.14 Problems 473 


Should the fixed-size sample or sequential test be used? What strategy would you 
adopt? 


10.17 Find the power function for the Problem 10.16. Hint: compare Eq. (3.57) 
with Eq. (3.67). 


10.18 We want to test the null hypothesis that the probability of heads in one toss 
of a given coin is p = 0.5 against the alternative hypothesis p = 0.3. What is the 
number n of tosses needed to decide between these two hypotheses, assuming a test 
level of 10% and a power of 90%, i.e. a = 0.10 and 6 = 0.10? Use the Gaussian 
approximation. 


10.19 With the same values of a and £ as in the previous problem, find the number 
of successes x as a function of the number n of tosses that allows to choose between 
the two hypotheses with a sequential test. Determine, by flipping a coin or with a 
simulation code, the average number of flips n needed to decide between the two 
hypotheses. Compare the result with that of the previous problem. 


10.20 If X has density p(x) = 1+ 6)x°, 0 < x < 1, find, with a sample of 
n = 100 events, the best critical region for testing 6 = 6 = 1 against 9 = 6; = 2 
at a level ~ = 0.05. Determine the test power and the power function. 


Chapter 11 ® 
Least Squares od 


Of all the principles which can be proposed for that purpose, I 
think there is none more general, more exact, and more easy of 
application, than that of which we have made use in the 
preceding researches, and which consists of rendering the sum 
of the squares of the errors a minimum. 


Adrian Legendre, “NEW METHODS FOR DETERMINATION OF 
THE ORBITS OF COMETS”. 


11.1 Introduction 


In Sect. 10.6 we showed that the least squares (LS) method originates from the 
maximum likelihood principle when the variables are Gaussian. 

Historically, however, things have developed differently. Indeed, while maximum 
likelihood was introduced by Fisher in the early 1900s, the least squares method 
was first applied by the French mathematician Legendre in 1803, as indicated 
in the epigraph of this chapter. Later Laplace, in his famous treatise “Théorie 
Analitique des Probabilités” of 1812, showed that the LS method produces unbiased 
estimates even in the case of non-Gaussian variables. The decisive step for the 
correct collocation of the LS method in statistics was then taken by Gauss in 1821, 
with the proof that the LS estimators are unbiased and efficient (minimal variance), 
when the observations are linear functions of the parameters to be determined. This 
fundamental theorem was extended and better formalized by Markov in 1912 and is 
now known as the Gauss-Markov theorem. 

On the basis of these results, we can state that, when the densities of the sample 
populations are not a priori known, the least squares method can be used as an 
alternative to the maximum likelihood method when the expected values of the 
observations can be expressed as linear combinations of the parameters. The LS 
principle is therefore not only a consequence but also a complement to the ML 
method, in respect of the class of unbiased estimators with minimum variance. 
If the expected values are a nonlinear combination of the parameters, the LS 
method is still applicable, but the resulting estimators could be biased and not of 
minimal variance. Later, we will also briefly comment this case. The LS method, as 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 475 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-031-09429-3_11 


476 11 Least Squares 


discussed in Sect. 10.6, consists in finding the minimum of a x7-type formula, such 
as Eq. (10.52). Generalizing, we can define the method as: 


Definition 11.1 (Least Squares (LS) Method) Let yj, yo, ..., yn, be the observed 
values of the random variable Y (called “dependent or response variable”) such as 
(Yi) = wi(xi,9) = wi (@). Concerning the function j;(x;, 6), 8 € © is a vector 
of parameters with dimension (p + 1), and x1,...,X, are n observed values of a 


variable X called “predictor or independent variable”. The variances Var[Y;] = a7 


are known. The least squares estimate 6 of 8 is the one that minimizes the quantity: 


n 


Li — mi OP 
70) =>. = —.. (11.1) 
: Oo; 
i=l 1 
The predictor can also be a vector X. In this case ju; (x;, 9) denotes the ith observed 
value of X. 


From the definition it is understood that the functions 1; (0), as @ varies, constitute 
a class of models that relate the (random or deterministic) variable X with (Y). 
Since this relation allows us to evaluate (Y;) without having observed y;, X is called 
the predictor, while Y is called the response. The minimization of Eq. (11.1) with 
respect to @ then identifies the best model within this class. Unlike the ML method, 
we note that here it is not necessary to know the distribution of the variables to 
estimate 0, but only their expected value and variance. Another important aspect of 
the method is that the variances a; do not depend on @. 

After the @ minimization, it is possible to test the model, if we are not sure of its 
validity. If the variances oa are known, the x? test can be performed as explained in 
Sects. 7.5 and 7.6. 

To avoid misunderstandings, we remind you that: 


e The minimization procedure is purely mathematical and not statistical and 
consists in finding the minimum of the “ x?-type” quantity (11.1). 
¢ The quantity: 


a 72 
‘ i [>i = i (8)] 
(X7)min = x7(6) =) =—_, (11.2) 
i=l 0; 

does follow the x*(n— p—1) distribution if Y; are independent Gaussian random 
variables with expected value ,14;(0), which is a linear function of 0. 

After this clarification, we think that to always denote as x” the quantity to be 
minimized, as we will do in the following, even if it may not follow the x?(n — 
p — 1) distribution, does not lead to confusion. ! 


! The degrees of freedom are now (n — p — 1) and not (n — p) because in Definition 11.1 the size 
of 6 is (p + 1). 


11.2 No Errors on Predictors 477 


If the goodness of fit test does not reject the null hypothesis or if one is a priori 
sure of the adopted model, it may be useful to give confidence intervals for 6 or for 
1t(8). We will see how to carry out this operation in the linear Gaussian case. The 
confidence intervals for 4(@) are then used to predict (Y;) at values of x; for which 
no experiments have been performed. 

In Sect. 10.7 we have already discussed an example, derived from the ML 
principle, where the LS method is used to fit density functions to histograms. In 
this chapter we will instead describe the most important and common applications 
of the LS method, i.e. the search for functional forms of jz; (9). Our goal is twofold: 
to explain a method (although not necessarily always the best) to perform and 
verify the fit that can be used in every situation, and enable you to choose the right 
algorithm and to use the minimization codes in a statistically correct way for any 
type of problem. 

We will not describe in detail how these algorithms work, because this is 
essentially a topic of numerical computation, rather than of statistics. For more 
details you can refer to the texts [BR92] and [PFT W972] or directly to the specific 
user manuals of the minimization codes, such as [Jam92]. In our web pages [RPP], 
you will also find the Linfit and Nlinfit routines to perform linear and 
nonlinear least squares fits and further technical-numerical information. 

Let us now begin by discussing two types of dependency between variables that 
may occur and which will be the focus of most of this chapter. 


11.2. No Errors on Predictors 


A very frequent statistical situation is given by a random variable Y written as: 
Y=f(x)+Z, f@)=f@,6), (11.3) 


where f(x) is a known function (apart from the @ parameter) of the non-random 
predictor x. 

Note that one can always set (Y) = f(x) and (Z) = 0, because if (Z) = zo, it 
would be enough to redefine f as f(x) + zo and Eq. (11.3) would hold again. This 
trick allows us to represent Y as the sum of a deterministic component (its expected 
value), which carries the essential part of the relationship with the predictor x and 
of a random part, which represents the fluctuations around the mean. 

In the case of a set of fixed and measurable (without error) x; values, a repeated 
sampling of Y for each value x; provides a graph similar to that of Fig. 11.1a, with 
fixed x; and Y; fluctuating around their average values. If only one measurement is 
performed and the standard deviation of the Z; is known, the plot is drawn as in 
Fig. 11.1b, with the error bars equal to the standard deviations 0; of Z;, and centred 
on the measured values y;. 

The goal of the least squares method is to determine the functional form f (x) 
which links (Y) to the deterministic variable x. In other words, we need to 


478 11 Least Squares 


Fig. 11.1 Sampling of a 
random variable y as a 

function of a non-random 
variable x; repeated 7 


y a) y b) 


samplings (a) and standard : 
representation in the case of a . 
single sampling (b) 


determine, for each x, the mean f(x, 6) as the curve with respect to which the 
fluctuations of Y are random. Here “random” means that the dependence between 
(Y) and x is completely described by f, the remainder being a pure error term. If in 
Eq. (11.1) 4;(0) = f (xj, 0), the x? is given by: 


Li — FOU. O)P 
=) (11.4) 
i i 
where the variance of Y appears in the denominator. Based on Eq. (11.3), it is given 

by: 


o? = Var[Y;] = Var[Z;] . 


For example, for a linear relation, one has: f(x, 0) = 6; + 62x. 

The search for a density starting from a histogram, obtained minimizing 
Eq. (10.60), is an instance of the non-random predictor case. In fact, the function 
Ft (xi, 0) = Np(xi, 9) coincides with the expected value of Y; = J;, the variables x 
are the spectrum values in the discrete case or the midpoint of the spectrum within 
the predetermined bin Ax in the continuous case, and there is no error on x. The 
statistical error is only on Y and is given by the fluctuations in the number of events 
falling in the different histogram bins during sampling. 

Another very frequent situation comes from the measurements of physical 
quantities, when one (the x variable) is measured with negligible error and basically 
under the experimenter full control, while the other (the Y variable) is evaluated 
with lower precision. In this case the function y = f(x, #) represents the physical 
law between the two variables. 

Let us now see a second type of dependence between variables defined by the 
relations: 


Y=f(M+Z, f(O=Hf(X,9), (11.5) 


11.2 No Errors on Predictors 479 


Fig. 11.2 Sampling of pairs y a) y b) 
of correlated random 

variables (a) and ° 7 

representation of the Pa 

corresponding density (b) 7 et 


where X and Z are random variables and (Z) = 0. Unlike Eq. (11.3), X is now a 
random variable, and between X and Y, there is a statistical correlation, in general 
different from zero. In a repeated sampling with n independent trials, we observe 
as many values of the random vector (X;, Y;), which we indicate with the pairs 
(xj, yi), i = 1,...,n. If we plot them on a graph, we obtain a point cloud like in 
Fig. 11.2a. The joint density of (X, Y), whose functional form is given by Eq. (4.3), 
can be represented as in Fig. 11.2b, indicating the values of its integral with different 
shades of grey in the different regions of the plane. 

As in Eq. (11.3), we assume that the argument of f is observed exactly. If we are 
interested not so much in the random mechanism that generates X as in the relation 
between X and Y, we can go back to Eq. (11.4), set a value x and use the distribution 
of Y conditional on {X = x} to calculate the x? function as: 


ag wr bDi- fi, OP 
x°(0) = X iT ie (11.6) 


Here the uncertainties a have been replaced with the conditional variances 
Var [Y;|x;], and the expected values of Y; have been replaced with the corresponding 
conditional expected values of f(x;,@) = (Y;|x;). Therefore, keeping x1, ..., Xn 
fixed, we obtain 6 by minimizing Eq. (11.6). According to the model, (Y|x) is given 
by f(x) independently of any (X, Z) distribution. Instead, the latter distribution 
influences Var [Y |x], since: 


Var[Y |x] = Var[ f (X)|x] + Var[Z|x] = Var[Z|x] . 
If X and Z are independent, then: 


Var[Y |x] = Var[Z|x] = Var[Z], and Var[Y] = Var[ f(X)] + Var[Z] . 
(11.7) 


480 11 Least Squares 


The denominator of Eq. (11.6) becomes constant, whereas if there is a dependence 
between X and Z the condition Var[Z|x] = Var[Z] is no longer valid. 

The functional dependence described by Eq. (11.5) is very common, and the case 
of Table 6.4 is one of them. In fact, the chest perimeter of a certain soldier will be 
related not to the average height value, but just to the height of that same soldier. 
We can also think of a financial study that tries to correlate the trend of the Milan 
stock exchange index with that of New York: the Milan index will be correlated to 
the current value of the New York index, with the superposition of other random 
variables containing (local) sources of variation. 

Equations (11.3) and (11.5) are formally the same, but the meaning of the 
function f is different: in the first case, x is a non-random variable, and f(x) 
represents an analytic functional dependence between x and (Y), while in the 
second case, f(X) induces the statistical correlation between them. Then, referring 
to Eq. (11.5), we can call f (X) as the correlation function between X and Y. 

To better clarify this concept, when X and Z are independent, Var[Z] = 
Var[Z|x], and from Eq. (11.7), we have: 


o- = Var[Y] = Var[ f (X)] + Var[Z|x] . (11.8) 


Since Var[Y |x] = Var[Z|x], Eq. (11.8) can be rewritten as: 


= (11.9) 


Var[Y |x] = oy (: 
2 


7 at) . 


This equation should be compared to Eq. (4.55), which represents the conditional 
variance of Y given {X = x} when the p.d-f. is the bivariate Gaussian. For 
convenience we rewrite this equation here: 


Var[¥ |x] = 0; (1 — p*). 


By analogy between this expression and Eq. (11.9), we are then led to define a more 
general form of the correlation coefficient between X and Y: 


_ 200) 


ae (11.10) 


where the sign is that of the covariance between X and Y, Cov[X, Y]. Notice that, 
from Eqs. (11.7) and (11.10), it follows that 6 = +1 when Var[Z] = 0 and that 
p = 0 when /f(X) is constant. It is also easy to show that, if f(X) = a+ bX, the 
correlation coefficient takes the familiar form (see Problem 11.1): 


olf (X)] _ Oxy 
a: Ane ca 


(11.11) 


11.3 Errors in Predictors 481 


We recall that, in Chap.4, we have demonstrated that the correlation function 
between two jointly Gaussian random variables can only be linear and that (Y|x) 
follows the regression line (see Eq. 4.54): 


Fx) = (Vx) = py +0 2 (0 = by). 
Ox 
This situation appears here as a special case of Eq. (11.5), valid for any function 
f(X), where the correlation coefficient is defined more generally according to 
Eq. (11.10). Also in this case, the Cauchy-Schwarz inequality (4.28) assures that 
the property —1 < p < | remains valid. 


11.3 Errors in Predictors 


A particular but important case regarding the study of a functional dependence 
between (Y) and (X) occurs when: 


X=xot+Xr, Y= fxo)+Z, f(xo) = fo, 9), (11.12) 


where Xr and Z are independent with a null expected value. The relation that 
determines Y is identical to Eq. (11.3), but here xg is subject to statistical 
fluctuations, that is, the predictor xo is unknown, and only the random variable X is 
observed. 

A repeated sampling of (X,Y), in correspondence with n true values x) = 
(x01,---;X0n), gives rise to a plot as that of Fig. 11.3a, with the variables x; and 
y; fluctuating around their mean values xo; and f(xo;). If a single measurement 
is performed and the standard deviations of Xp; and Z; are known, the graph 
is drawn as in Fig. 11.3b, with the standard deviations of Xp; and Z; drawn as 
error bars centred on the measured values x; and y;. This is often the case of 


Fig. 11.3. Sampling of a 


random variable Y as a . a) > b) 
function of the mean value of 

another random variable X; | he L 

repeated samplings (a) and ee 

standard representation in the less 7 | 

case of an experiment with a - ° 


single sampling only (b) 


482 11 Least Squares 


laboratory measurements, when two quantities linked by a deterministic physical 
law are measured with relevant uncertainties. Here too, the task of the least squares 
method is to find the physical law as a best-fit curve, reducing the effects of the 
uncertainties introduced by the measurement. 

From Definition 11.1, the x” to be minimized results in this case: 


x°(0, x0) = Dae au +p ee (11.13) 


Oy, 


Now (;, yj) are the observed pairs of points, and the crucial aspect of this equation 
is that, according to Eq. (11.12), f is a function of the unknown mean variables xo, 
not of the measured ones. The free parameters to be estimated are then (n + p+ 1), 
i.e. the sum of the dimensions of xo and of 0. 

The minimization of Eq. (11.13) is not difficult if one has a nonlinear x2 
minimizing program which accepts the equation of the function to be minimized. 
This is the case of our Linemq routine that we use to solve Problems 11.2 
and 11.3. However, the doubled number of experimental values and the high number 
(n+ p+1) of free parameters can cause problems when dealing with large samples. 
To overcome these problems, we approximate the minimum point of Eq. (11.13) by 
setting xo; = x; in the first sum, which thus is always zero, and by evaluating the 
errors oy, with a first-order Taylor expansion of f(X) around x9: 


f (X) = f (xo) + (X — x0) fo) + 0(X = x0)? . (11.14) 


If Var[X pr] is small enough to omit the terms of order higher than the first (even if 
random), we can apply the variance operator to Eq. (11.14) to obtain: 


Varl f(X)1 =f’? (x0) 02. 


Since f’ (xo) depends on the unknown xo value, we replace it with the observed 


value x, to obtain the so-called effective variance an 


Var[Y — f (X)] = Var[¥] + Varl f(X)] ~ of + f° @)o2 =o3, (11.15) 


where the assumption of the independence between Xz and Z has been used. 
Using the differences (y; — f(x;,6)) in the he numerator and the effective 
variance in the denominator, one obtains the simple result: 


bape = Se — f(x, OP 
x°(0) = yee = ap H+ FG, OE ee! (11.16) 


i 


11.3 Errors in Predictors 483 


where only the measured values (x;, y;) appear together with their variances taken 
into account through the effective variance Ta: In Eq. (11.16) the free parameters 
are again only the (p + 1) parameters 6. 

If the values of f’ 2 (r;, 0) are unknown, Eq. (11.16) does not follow Defi- 
nition 11.1, and the x? minimization gives always rise to nonlinear equations 
due to the derivative of f in the denominator, even when /f is a linear function 
of the @ parameters. However, if one has only linear minimization programs or 
intends to reduce the computation time, an iterative method can be used, starting 
at the beginning with oe — oy. and setting a, = G, + f(x, oF 2 
in the k-th cycle, using the estimates 6~" of the previous iteration. In this 
way, the denominator becomes independent of the parameters being minimized 
in the current cycle, and the linearity is restored [Ore82]. However, it can be 
shown that this procedure gives inconsistent estimators (even if closer to the true 
parameter values than the methods that ignore the errors on X [Lyb84]), so that 
a nonlinear minimization code to solve Eq. (11.16) should be used. The results 
obtained with this algorithm are practically coincident with those given by the 
general formula (11.13). This fact can be verified with our Fit LineBoth routine, 
which performs the minimization of Eq. (11.16). In the case of a straight line, if the 
data x, y, Sy, Sy have been stored in the vectorsx, y, Sx, sy, respectively, the 
calling instructions are the following (for more details you can see the comments 
inside the routine): 


>fun<-function(par,x) {return (par [1] +par [2] *x) } 
>dfun<-function{par,x}{return(par[2])} # derivative of fun 
>FitLineBoth(x,y,sx,sy,par=c(0.5,0.5), fun=fun, dfun=dfun) 


The effective variance has an intuitive geometric interpretation, shown in 
Fig. 11.4: the term f’ 262, which is added to the initial variance Os is just the 
effect of the fluctuations in x projected on the Y axis using the tangent of the angle 
a (i.e. the derivative f’(x)). 


Fig. 11.4 Geometric f(x) 
interpretation of the effective 
variance 


Oy P(x) Oo. 


1 oy + [PI o, 


484 11 Least Squares 


If the variables X; and Y; have a non-zero covariance Cov[X;, Y;], using 
Eq. (5.65), we can also further generalize Eq. (11.16) as follows: 


2 
2(9) = ee 9 ee C01 2) ; 11.17 
xO) dX a + FP GK Oo? — 2f' i, A) Ox; 5; 


11.4 Least Squares Regression Lines: Unweighted Case 


In this section we examine a particularly simple case of the relation (11.3), which 
includes also Eq. (11.5), considering X and Z independent and conditioned to the 
observed values of X. Therefore we will always write x in lowercase. 

When the function connecting (Y|x) to x is a straight line, that is, f(x,0) = 
a+ bx, with 0 = (a, b), the relation between x and Y becomes: 


Y=a+bx+Z, (11.18) 


where we assume also that Var[Y] = Var[Z] = Var[Z|x] = a? independently of 
x. For example, this happens when the distribution of (X, Y) is a bivariate Gaussian 
(see Eq. (4.55)). 

Moreover, we assume o? to be unknown, since the opposite case can be 
considered as a particular situation of weighted linear least squares, that we will 
describe in Sect. 11.6. 

After the observation of an experimental random sample (x1, y1),.--, (%n, Yn), 
the x7 to be minimized can be written as: 


1 
x7 (a,b) = ge Di a bai)? (11.19) 


The minimization of this function does not depend on o? and requires setting to 
zero the partial derivatives: 


ax? =9 > ;—na—b) x; =0 

Ev Vi t ’ 

16 3) Pad ab (11.20) 
aD iyi ; 1 i F . 


To simplify the notation we define: 


n n b n 
5.2) 6s. SS os. Seo SoS ee. 92) 
i=1 i=1 i=1 i=1 


11.4 Least Squares Regression Lines: Unweighted Case 485 


so that Eq. (11.20) generates the system of equations: 


an+bS, = Sy (11.22) 
a Sy + b Sxx = Sxy ’ (11.23) 


GE)G)-G) as 
Sx Sex b Sxy 


The parameters a and b represent the unknowns to be determined, while the 
sums (11.21) are known numerical coefficients. The least squares estimates of the 
parameters are immediately obtained from the matrix analysis methods: 


or, in matrix form: 


ee | 
a= D [Sx Sy — Sx Sxy] , (11.25) 
ae | 
b = —[nSyy — Sx Sy], (11.26) 
D 2 
where: 
D=nSeg= S82 , (11.27) 


is the determinant of the system (11.24). 

The parameters thus evaluated are marked with the symbol “””’ to indicate that 
they are estimates, not the true values of a and b. 

Using Eq. (4.23), after some algebra, Eq. (11.26) can be rewritten as: 


where r is the estimate given by Eq. (6.117) of the linear correlation coefficient o 
of Eq. (11.11). Moreover, from Eq. (11.22), we obtain a = (y) — b (x). Given these 
coefficients, the estimated least squares line }(x) = @ + bx can be considered as an 
estimate of the regression line (4.54), also discussed at the end of Sect. 11.2: 


§(x) =G+bx = (y) t+r2(e- (x). (11.29) 


So we conclude that the regression line represents the locus of the points of the 
conditional distribution averages for Gaussian variables or the least squares line 
for non-Gaussian variables. The error on the parameter estimates, which will be 
indicated as usual with the symbol s, is obtained by applying Eq. (5.65) on the 
variance of functions of random variables. This allows us to evaluate the propagation 


486 11 Least Squares 


effects of the fluctuations of the variables Y (i.e. of the random part) on the 
parameters a and b using Eqs. (11.25 and 11.26): 


2 
oF =2>)(=) (11.30) 


The covariance Cov[Y;, Y;] now does not appear because the independence of the 
observations Y; is assumed. Taking into account that: 


; OS, 
us) =1 and —“= Kiss 
OY; OY; 
one obtains: 
ee [S. Sy xi] are [ Sx] (11.31) 
—— ee — Xil ; —_— SS — (NI ee . 
ay; D ef Bi tf ay; D Ll x 


A Z Sxx 
Var[a] = D2 dX ([S. = ScxiP) = a: no (11.32) 
2, 
Var[b] = as ([nxi — SP) =o 5 , (11.33) 


These are exact formulae of the estimator variances, because a and b depend linearly 
on y;. We may wonder whether these errors behave like 1/,/n, as the error on the 
mean estimate of a random sample. Having denoted by 52 the sample variance of 
(X1,...,X,) With denominator n instead of n — 1: 


-2 Sixx Ss? 
i = 


oe as (11.34) 
with some little algebra, one finds indeed that: 
» OF a? ee 
Var[a] = — (1+ » Var[b]=—s. (11.35) 
n se n sé 
If (x1, ..., X,) behaves as arandom sample, Ge and (x)? almost certainly converge to 


constant values, and therefore the errors converge to zero as 1/,/n. We have already 
noticed several times how this behavior is common to many statistical estimators. 


11.4 Least Squares Regression Lines: Unweighted Case 487 


When oe is unknown, Eqs.(11.35) does not provide an estimation of the 
uncertainties on the parameters. However, the problem can be solved using the 
experimental data. If we rewrite Eq. (11.18) as Z = Y—(a+bx), since oa? = Var[Z], 
its estimate can be obtained from the sample variance of the residuals of the least 
squares line: 


2i= yi — (4+ bx) = yi — 5, (11.36) 


with (2) = 0 because, from Eq. (11.29), (y) = ( 3). Therefore, the sample variance 
of the residuals is: 


n 


l n 1 7 2 
Dre a2 ' a ‘ 
sr= =o 2 = WoO 2 E -—a — bx;| ; (11.37) 


where the variance estimator is unbiased due to the (n — 2) denominator. The 
estimates of a and b are then: 


2 2 

9a) = = (: a (11.38) 
n Sy 

a8 Cra 

rb=2—. (11.39) 
n Sy 


Now we have all the elements needed to compute the confidence intervals for 
the parameters of the regression line, when the Y; are assumed to be independent 
Gaussians with mean (a + bx;) and variance an Let us start from the fact that if 
we knew that b = 0, the least squares problem would coincide with that of the 
sample mean estimate (y1,..., y,). This has been solved in Sect. 6.11, showing 
that M (i.e. @) and S?/o? are independent and follow a Gaussian and a reduced x7 
(with n — 1 degrees of freedom) distribution, respectively. So /n(M — j2)/S has 
Student’s distribution t with n — 1 degrees of freedom. A similar result still holds 
when we also have to estimate b, since both a and 6 are linear combinations of 
independent Gaussian variables, while s?(a) / o7(a) and s” (b) / o? (b) have a reduced 
x? distribution with n — 2 degrees of freedom. Therefore, both have a (4 — a)/s(@) 
and (b—b) /s (b) Student’s f-distribution with n —2 degrees of freedom, respectively. 
Using the corresponding t}~g/2 quantiles, the confidence intervals for a and b with 
CL= 1 — @ are then given by: 


G@+t_a2sG@), b+t-o/25(b). (11.40) 
Once the model is estimated, it can be used to evaluate (Y|x) at a point x also 


not included among the experimental points (x1, ..., x,) or to predict the value we 
would observe for Y in correspondence of x. 


488 11 Least Squares 


To predict (Y |x) it is natural to use the estimate (x) = a+b(x). Its error depends 
on the covariance between a and D. In fact, applying Eqs. (5.65) and (11.32, 11.33), 
we obtain: 


i ax? Kyte 
s°($(x)) = (= ) s°(a@) + (= ) 1b) 2S sa, b), (11.41) 
da ab da 


where s(a, b) is the covariance estimate. To estimate this parameter, with the usual 
notation AX = X — (X), we apply Eq. (5.83) and expand to the first order a and b 
in the variables y;, to first calculate the covariance Cov[4, b]: 


‘ F aa ab aa ab 
Cov[d, b] = (Aa Ab) = 5° — — ((Ay;)*) = 4° — — Varl Xj], 
ov[d, b] ( a Di Oi (( ;) ayn Oo; ar[ Y;] 


where the mean value is not applied to the derivatives because they are constant in 
the measured values y;. Recalling Eq. (11.31) and with the notation of Eq. (11.21), 
one finally obtains: 


he. My 
Cova, b] = 55D (Sxx — Sexi) (nx; — Sx) 
i 
o: 2 
= D2 [nSyx Sy — NSxy Sy — NSySyx + Sy Sy] 
Sxx — S? S 
= — 02 5, MT Sel 92S, (11.42) 


and then, using Eq. (11.34), one gets the estimate: 


~ * S ge 
s@,6)=-2 %=- 24). 


Substituting this value into Eq. (11.41), and using Eqs. (11.38, 11.39), we finally 
obtain: 


2 
oe" |. (11.43) 


Sz (x 
s°($) = a = 
n Ne 
Since }(x) is a linear combination of independent Gaussian variables, it is again 
possible to prove that 


I(x) — Gt bx) 
s(I(x)) 


11.4 Least Squares Regression Lines: Unweighted Case 489 


is Student’s variable with n — 2 degrees of freedom. Hence, a confidence interval 
for y(x) =a + bx with CL= 1 — a@ is given by: 


y(x) € P(X) E t-a/25(F(x)) - (11.44) 


Equation (11.43) is quite instructive, because it clearly shows the risk inherent 
in extrapolations: the expected error increases roughly as (x — (x)) as x moves 
away from the “center of gravity” of the experimental points, represented by (x). 
This is true even assuming that outside the experimental spectrum the relationship 
between x and (Y|x) is still linear, which we cannot affirm or deny only based on 
the observed sample. 

If linearity is not maintained, Eq. (11.43) is not even an appropriate estimate of 
variance. This remark is confirmed by evaluating the intervals (11.44) as a function 
of x and using )(x) as ordinate. The result is a band around the least squares line as 
shown in Fig. 11.5. Remember, however, that this is not a simultaneous confidence 
set with CL = | — a, but only the combination of different univariate intervals. 

A new observation Y for a given value X = x still has the expected value a + bx; 
we will therefore use (x) as an estimator of Y. However, the variance of the random 
part of Y must also be taken into account in the prediction, as well as that of (x). 
Therefore, the error associated with the estimate is now given by: 


Var [(x) + Z|x] = Var [5(x)] + Var[Z|x] = Var [5(x)] + 02 , (11.45) 
Fig. 11.5 Confidence belt 10 
J + t-a/28(9) of the least y 


squares line. s (3) is estimated 
with Eq. (11.43) for 

(x) = (y) = 1, sy = 1 and 
s/n =0.5 


490 11 Least Squares 


where the variances are added together because the new random part of Y is 
independent of the fluctuations of a and b. 

Substituting the estimator obtained in Eqs. (11.45) and (11.43) and replacing Ge 
by a we obtain the error as: 


= 2 
PG@tbx+Q=s2 +95) =s? ji+2 (14) (11.46) 


In summary, the prediction interval of Y for a fixed x value and for a given 
confidence level CL = 1 — a becomes: 


2 
x 


nt 1 Gein)? 
Y(x) €a+bx + t-a/2 Sz ,/1 + — | 1+ ——— ]. (11.47) 
n 
where, as for the a + bx estimate, t;_~/2 is Student’s quantile with n — 2 degrees of 
freedom. 


Exercise 11.1 
Consider again Exercise 6.10, and recalculate the interval estimate for the 
thoracic perimeter of a 170-cm-tall soldier. 


Answer The solution previously found did not take into account the uncer- 
tainty due to the use of sample means, variances and correlation coefficient 
instead of the true ones. Now, if we recall the result of Exercise 6.10: 


t €m(t\s) +s(t\s) = 86.9+4.3 cm, 


and apply Eq. (11.47), we can include this uncertainty to get the correct 
estimate: 


1 
t € 86.9+4.3 l+ agg (1+ 


1665 5.792 


= 86.9+ 4.3 - 1.0006 ~ 86.94 4.3. 


(170 — ~~) 


The previous result, even after the correction, remains virtually unchanged. 
The confidence levels are calculated with Student’s density with (1665 — 2) 
degrees of freedom, which can be safely considered as Gaussian. Therefore, 
68% of 170-cm-high soldiers has the thoracic perimeter between 82.6 and 
91.2 cm. The confidence interval (11.44) containing the true mean (in the 
frequentist sense) with CL ~ 68% for x = 170 cm is: 


86.9+0.lcm. 


11.5  Unweighted Linear Least Squares 491 
11.5 Unweighted Linear Least Squares 


Here we generalize the discussion of the previous section to the multidimensional 
case. 

In the most general linear minimization problem, we have p predictors collected 
inside a vector x. The equation corresponding to (11.18) can then be written as: 


Pp 
Y = u(x, 0)+Z = f(x,0)+Z=O%+) Ojxj;+Z, (11.48) 
j=l 


where x ; is the jth element of x. As in Sect. 11.4, we set (Z|x) = 0 and Var[Z|x] = 
of, Therefore: 


P 
(Y|x) = 00+ > 0jx;, Varl¥|x] = 0? 


j=l 


The components of x can be different variables or functions of the same variable, or 
a combination of the two, depending on the problem. For example, we could have a 
response which depends on a single predictor through a polynomial of degree p: 


Pp 
f(x, 0) =O + > 6;x!, (11.49) 
j=! 
with x? = (x, x7,..., x?) or, more generally: 
Pp 
f(x,0) =O + > Oj f{(x) . (11.50) 


j=l 


with x? = (fi), fox), ..., fo(x)). If p = 1 in Eq. (11.49), we are again in the 
particular case of the least squares line. The goal is still to get an estimate of # from 
arandom sample (x1, y1),---; (Xn, Yn), with (p + 1) <n. 

In a multidimensional problem, it is convenient to switch to matrix notation. To 
facilitate the reading, we briefly recall some matrix calculus rules considering two 
generic matrices A and B. Denoting with + the matrix transposition, that is, the 
exchange between rows and columns, the following properties hold: 


(AB)'=BtAt, (A+B)t=Ats+ Bt (ATNT=AA, (11.51) 
if the matrix dimensions are compatible. If A is a square matrix, its inverse A~! has 


the property AA~! = J, where J is the identity or unit matrix. If the inverse matrix 
exists, A is said to have rank equal to the number of its rows or “full rank”. If B is 


492 11 Least Squares 


also an invertible square matrix, the following properties hold: 


(AB) !=B'A, AtA=TI, (AT IH=A. (11.52) 
If A is a matrix with m rows and k columns (k < m) and of rank k, the k x k matrix: 
AtA =(AtA)t (11.53) 


is a symmetric positive definite matrix. This property means that, for any vector c 
different from zero and with compatible dimensions, c' A‘ Ac > 0 . In addition to 
these formulae, it is also worthwhile to read again Sect. 4.5. 

Let us now go back to our original problem. Denoting with x;; the value of the 
jth element of x;, we define the predictor matrix as: 


1 x11 X12... X1p 
| Peep (11.54) 
1 Xn1 Xn2 ..- Xnp 


We then denote with y the column vector of the responses (y1,..., yy) and by 0 
the column vector of the parameters (60, 01,..., 0 )). The column filled with unit 
values in the X matrix takes into account the constant part of the model, which 
should always be included, unless one is sure that 6) = 0. In this way we can 
rewrite Eq. (11.48) with the compact matrix equation: 


Y=X04+Z, (11.55) 


where Z is the column vector (Z,,..., Z,) of uncorrelated random variables with 
zero mean and constant marginal variances Var[Z;] = ae, 


The x2 to be minimized for the estimate of 6 can be obtained from Eq. (11.1): 


n 


1 P 
70) = > iby — 0 — DP xis)P (11.56) 
j=l 


2 i=l 


where the point of minimum depends on the numerator only. This relation, written 
in a matrix form, becomes: 


oo x? = (X60 — y)' (XO —y). (11.57) 
The condition for the minimum of Eq. (11.56): 


ax? = 


She Boe. 
a0 P 


11.5  Unweighted Linear Least Squares 493 


leads to the so-called normal equations: 


n 


Dp 
Yo] [1-9 — > Oxi; | rie | =O, K=0,1,...,p, (11.58) 
j=l 


i=1 


where the equality xj9 = 1 must be used when k = 0. The minimum given by the 
normal equations (11.58) can be written in the compact form: 


X' y=(X'X)0 or B=a8, (11.59) 


where the matrices 8 = X'y and aw = X‘X have been introduced following 
a rather common notation [BR92, PFTW92]. This equation can be solved for 
the unknown parameters 0, by inverting the (p + 1) rank matrix XX, which 
gives the LS parameter estimates. This result finally represents the solution of the 
linear unweighted least squares with multiple predictors (also known as multiple 
regression problem): 


6=(xtx) lxty=a'B. (11.60) 


These fundamental equations are encoded within the R library by the 1m function, 
which is used in our Linfit routine. When (Y|x) = a + bx, it is easy to verify 
that normal Eq. (11.59) just corresponds to Eq. (11.24). 

We now calculate the errors on these estimates using the matrix formalism and 
bearing in mind that Eq. (11.60) is a particular case of Eq. (5.75). Then, we can 
write Eq. (11.60) as: 


6; = fi. Yn)s J=0,...,p, 


where any function f; is a linear combination of the elements of y. We apply the 
variance transformation of Eq. (5.77), where in this operation the jith element of 
the transport matrix, given by df; /dy;, is nothing else than the jith element of the 


matrix (X'X)~!X*. Applying this result to Eq. (11.60), the covariance matrix of 6 
is obtained as: 
V (8) = (XTX) | XT VY) X[(XTX) 
= (xy x oT KIT 
Sais) Se, sys (11.61) 
where Eqs. (11.51 and 11.52) have been used, the symmetry of (X'X) has been 
considered together with the fact that V(Y) = o?1 is a diagonal matrix with all the 


elements equal to Cre This important relation shows that the inverse of the matrix 
o = X'X contains all information about the errors of the parameter estimates. 


494 11 Least Squares 


Indeed this matrix, called error matrix, is a square matrix of dimension (p+1) x (p+ 
1), given by the number of free parameters, and is symmetric and positive definite. 
The diagonal elements are the variances of the LS estimates of the parameters 6, and 
the non-diagonal elements represent the covariance between all pairs of estimates 
(6;, 0). 

An instructive verification of this statement can be done by applying Eq. (11.61) 
to the two-dimensional case (least squares line); in fact, we obtain: 


: o? Se 6 
V a,b = z XX x ; 
Gl = ae es n ) 


in agreement with Eqs. (11.24, 11.32, 11.33, and 11.42). 

The evaluation of (Y|x) at the generic point x is performed, as in Sect. 11.4, via 
the linear transformation }(x) = xt 6, where we can consider x‘ as a row of the 
X matrix or a new set of predictor values. Applying again Eqs. (5.75) and (5.77) to 
this new particular transformation, we immediately obtain the variance of }(x): 


Var[$(x)] = x'V(6)x , (11.62) 


which is the generalization of Eq. (11.41). 
The correct error estimate oz can be obtained analogously to Eq. (11.37): 


2 _ X6- yt (X6-y) 


de 
z 


(11.63) 
n—-p-1l 


The degrees of freedom at the denominator are still given by the number of points n 
minus the number (p + 1) of estimated parameters. 

If the Y; are independent and Gaussian-distributed variables, we can obtain, for 
each 6;, confidence intervals that are similar to those given by Eq. (11.40): 


6; € 6; +h-a/25(6)) , (11.64) 


where s(6 };) can be derived from Eq. (11.61) by replacing o, with s;: 


s(6;) = Sef (XTX); } ; (11.65) 


The confidence interval of y(x) = (Y|x) with CL = 1 — aq assumes a form 
analogous to Eq. (11.44): 


y(x) € p(x) t—a/2 Sev xt (XT X)—!x : (11.66) 


where tj-q/2 is Student’s quantile with (n — p — 1) degrees of freedom. The 
prediction interval for Y (similar to that of Eq. (11.47)) at a given value x is identical 


11.6 Weighted Linear Least Squares 495 


to that of Eq. (11.66) after adding | to the term under square root: 


Y(x) € ¥(x) + t—a/2 Sev 1+. x1(XTX)—!x ; (11.67) 


Often the fitting procedure is complicated and difficult to interpret due to the 
correlations between the LS estimates of the parameters, whose values change from 
one fit to another if we increase p in Eq. (11.48) by adding additional predictors to 
the ones used in the previous fit. To obtain uncorrelated estimates, the (X tx)! 
matrix must be diagonalized using sophisticated matrix calculus techniques or 
orthogonalizing the X matrix to satisfy the condition: 


Yo xix =O ifk AL. (11.68) 
i 


In this way X'X becomes diagonal, and also the error matrix, which is its inverse, 
is diagonal. Then, the covariances are all zero and the parameters are uncorrelated. 
Although important in practice, we will not discuss diagonalization methods here, 
since they are quite laborious. Interested readers can easily find them in texts 
devoted to numerical computation techniques, such as [PFTW92]. 


Exercise 11.2 
Write the normal equations for the quadratic function: 


Y = fO@a bc) =Sa bX ex. 


Answer Using the notation of Eqs. (11.21) and (11.59) becomes: 


Sy Sx S)2 a Sy 
S, S,2 S,3 by} =] Sxy 
S,2 Sy3 S,4 c Ses 


The comparison of these equations with Eq. (11.24) immediately suggests the 
generalization of the normal equations to a polynomial of any degree. 


11.6 Weighted Linear Least Squares 


In Sects. 11.4 and 11.5, we assumed the different observations of the response 
variable as uncorrelated and with constant variance, even if unknown. A first 


496 11 Least Squares 


deviation from this hypothesis is to assume non-constant variances, 1.e.: 


G7 0 nia 0 
2 
(=r | Oe"), (11.69) 
OY scar 


that is, Var[Y;] = ar Vi and Cov[Y;, Yj] = 0, fori # j. More generally, the 
covariances may also be non-zero, with a non-diagonal V matrix. 

When all the elements of V are unknown, the estimation problem would be even 
more difficult. Therefore, here we analyse the situations where all o; are known, 
showing also that a solution is also feasible when the ratios o;/oj; are known, i.e 
when we can quantify how much the ith response is more (or less) variable than 
the jth one. In this last case, we can write the covariance matrix as V = o2 Wel, 
where: 


1 
a 0 0 

wie Ose ; (11.70) 
i 


and o, is acommon error scale factor to be estimated from the data (see Eq. (11.79)). 
The weights (w1,..., Wn) are all known; then ae = a2 /wi and also the ratios 
rs /o; are known and equal to w;/wj. 
When all a are known (absolute weights), V = W~! and the the weight matrix 
can be formally written as in the previous case with: 


wi = 1/o?. (11.71) 
The x? to be minimized for the multiple regression with a = (1, xj1,..., Xip) now 
becomes: 
n taq2 n 
[yi — x; 6] 2 
x76) =) ——— = >> [vai yi — Wi x}6] 
i=l 


2 
i=1 9; 


= [yi - soy (11.72) 


11.6 Weighted Linear Least Squares 497 


Instead, in the case of relative weights, the a to be minimized takes the form: 


n i 9 2 n 7 
jy —* P= Sy [vir — var xo] 
i z i=] 


i=l 1 


_ly [5 - x0] (11.73) 


In both situations, we set yj = ./w; yj and x; = ./w; x;. Passing to matrix notation, 
u : : : 2 1 
we denote by W2 the diagonal matrix of the weight square roots and set y = WZ y 


and X = w2 X. With the transformed variables, the linear model we are considering 
1S: 


Y = X04+Z, (11.74) 


where Z is the same of model (11.55). It immediately turns out that minimizing the 
x? of Eq. (11.72) is equivalent to minimize: 


x? = (XO — j)' (XO — 5). (11.75) 


This x? has the same form given in Eq. (11.57), so that, taking into account that 
W2W? = W, the symmetry of W and Eqs. (11.51), we have: 

6 = (KX) Xt 5 = (Xt Wx)! Xt Wy sa 'B, (11.76) 
where the matrices defined in Eq. (11.59) now become aw = (X + WX) and p= 
xX’ Wy. 

The errors of the estimates are easily evaluated from Eq. (11.61): 
V(6) = (X'X)7! = (X' WX)! absolute weights; (11.77) 
V(0) = 02 (X*X)-! = 0? (X' WX)! relative weights. (11.78) 


In this last case, recalling Eq. (11.63), the estimate of of becomes: 


2 KO-W'KO-5H) _ KO-y'WKS-y) 


> 
Zz 


(11.79) 
n—-p-1l n—p-\l 


If the responses Y; are independent Gaussians, the confidence intervals for the 6; can 
be easily evaluated by applying Eq. (11.64), with the error of Eq. (11.65) replaced 


by: 
(Xt W x) =4 jo) absolute weights ; (11.80) 


498 11 Least Squares 


Sz4/ (Xt W xy; = 5, fo} relative weights . (11.81) 


All the output results of the Linfit routine are obtained with the weighted least 
squares Eqs. (11.72-11.79). 

The “prediction” y(x) for a given x value is generally not of much interest; 
therefore it is not worth applying Eq. (11.66) to y directly. We just remark that 
v(x) = x6 still holds for the untransformed response variable and thus the 
variance estimate of (x), in the case of absolute weights, is x'(X'WX)-'x, as 
in Eq. (11.62). With relative weights we have instead (s?)x'(X'WX)~'x. The 
confidence interval with CL = | — a is then as that of Eq. (11.66): 


y(x) € $x) $ t-a/2 (sz) Vet TWX) , (11.82) 


where the term (s,) is included only in the case of relative weights. The prediction 
interval for Y(x) is obtained from this equation by adding to the estimate of (x) 
that of the random part of the model for Y for a given x: 


. Zz 
Y(x) =x'@ 4+ —~ 
(x) ss oy 


that is, Var[z/,/w] = 1/w for absolute weights and Var[z/,/w] = o7/w or its 
estimate ce /w for relative ones. Finally, as in Eq. (11.67): 


Y(x) € 9X) Et_a/2 (Sz) 4 - +xt(XTWX)-!x, (11.83) 


where again the term within brackets (s,) must be included only for relative weights. 
The previous discussion is valid if W is a diagonal matrix. To be more general, 


we examine the transformation y = wt y together with the obvious factorization 
of V, that is, V = o2w! — o2W-t w-?. If V is non-diagonal and we set 
V= of Wl, where W is a known matrix, we can obtain a similar factorization by 
applying Eq. (4.68): W~! = H H". Here H plays the same role of W~2. Based 
on the results of Sect.4.5, we realize that the transformation Y = H~'Y makes 
Y a Gaussian vector with covariance matrix equal to ofl and vector of the means 
X6 = H~'X60, bringing us back to the hypotheses used in the previous paragraph to 
obtain all the confidence intervals. Therefore the results from Eqs. (11.75) to (11.82) 
hold, without modifications, for any covariance matrix aw Also Eq. (11.83) 
continues to hold by replacing 1/w with the ith element on the diagonal of W~!, if 
we consider an experimental point x;, or with a new coefficient if not. 


11.7 Properties of Least Squares Estimates 499 
11.7 Properties of Least Squares Estimates 


We now demonstrate the three fundamental theorems (including that of Gauss- 
Markov) which are the basis of the linear least squares estimation. If you do 
not appreciate mathematics, you can only read the theorems statements (and their 
consequences!) and move on to the next section. 


Theorem 11.1 (On Correct Estimates) In the case of linear dependence on the 
parameters, the least squares (LS) estimates are unbiased. 


Proof Applying the operator of the mean to Eq. (11.60) and recalling Eq. (11.55), 
one immediately gets: 


(6) = (xtx)! xt (vy) = xt xy! xtxe =e, 


according to Eq. (10.14). 
Theorem 11.2 (Gauss-Markov) With reference to the model of Eq. (11.55), the 
LS estimator has minimal variance (i.e. is the most efficient) among all the unbiased 


and linear estimators of 0. 


Proof We must show that, if @* is an estimated unbiased parameter of a linear model 
Y, one has: 


a‘'V(0)a <a'V(6*)a, (11.84) 


where a is any vector of dimension (p + 1). In particular, if a contains all zeros and 
value | in the ith position, Eq. (11.84) includes also the cases: 


V (6) < V(Ox), i=0,1,...,p. (11.85) 


We therefore consider a generic unbiased estimate of the parameters 6 which is 
linear in Y: 


6* = UY. 


For the least squares estimators, U = (X'X)~! X*. Since an unbiased estimate has 
been assumed, from Eq. (11.55) one has: 


(*}=U(Y) =UX0 =8@, 
and hence: 


UX=1, (UX) =I. (11.86) 


500 11 Least Squares 


It is crucial to note that this property does not imply U = X~!, because U and X 
are not square matrices. Based on Eq. (11.61), we have: 


V(0*) =07UU'. 
The following identity is also valid: 
UU’ =C+(U—CX')(U-CXx’)', (11.87) 
where C is the LS error matrix divided by oe 
Camry. 


This relation can be easily verified by developing the right term of the previous 
equation using Eq.(11.86) and because C, V and their inverse are symmetric 
matrices coincident with their transpose: 


C400 =Cx'' Ux 40x xc 
=C+uUU' —cxtut—uxct+c(x'x)c 
=C+UU'—-C-—C+4+C(XiXx)c 
=C+UU'-C-C+c=uUU'. 


From Eq. (11.61) we can write Eq. (11.87) as: 
V (6*) = V(6) +02(U — CX) — CX?) 


which shows that the covariance matrix 0* is equal to the covariance matrix of the 
LS estimates 6 plus a symmetric positive definite matrix written in the form of 
HH’, as in Eq. (4.68). This proves the theorem. 

Clearly, the equality V(6*) = V(6) occurs when U = CX* = (X*X)~! X', as 
it is easy to verify. 

It is important to note that this theorem does not imply any assumptions about 
the population distributions or about the size of the sample y. The only requirement 
is that the average values (Y) are linear functions of the parameters. 


For the weighted least squares of Sect. 11.6, the theorems just proved continue to 
hold by replacing X with X and Y with Y. 

We have seen that, in the case of Gaussian variables and linear models, the 
LS estimates also provide Gaussian intervals for the parameters 6 for (Y|x) and 
prediction intervals for Y itself. Moreover, it is possible to perform a x~ test of the 
fitted model using the statistic x7( 6) of Eq. (11.2). 


11.7 Properties of Least Squares Estimates 501 


The x7( 6) variate has a x7 (n — p — 1) distribution, and the model fit is usually 
considered unsatisfactory at a level a if: 


st=P{ 08 (11.88) 


where Q follows the reduced x? distribution with (n — p — 1) degrees of freedom. 

Less used, but safer, is the two-tailed test of Eq. (7.34). Following the discussion 
in Chap. 6, we know that this test also protects against the use of models with too 
many parameters, which tend to interpolate the data, and then produces too small 
x? values (overparametrization). 

Under suitable conditions an important theorem [SW89] allows us to extend 
these properties even to nonlinear models and non-Gaussian variables. 

The following discussion applies to x7 functions that follow Definition 11.1, 
hence excluding the effective variance case. The models considered in the theorem 
are in fact of the type: 


Yj = mi(x;,0)+ Yr, i= 1,...,0, (11.89) 


where x; are fixed and Yr; are iid variables with null mean and known variances 
ar: Definition 11.1 follows Eq. (11.89), because the assumption (Y;|x;) = (0) 
coincides with Eq. (11.89) and includes the cases of both the non-random and 
random predictors conditional on x. We assume that 4;(0) = f(x;,9) for all i 
and use the notations: 


HO) = (141(0),..-, Un (O))* , 


rp, — 2H@® 
v 36; 


’ 


where F;; are the elements of an F matrix of dimension n x (p + 1) that generalizes 
Eq. (11.54) introduced in the linear LS case. 


Theorem 11.3 (Least Squares Estimates) Given the model (11.89) with iid vari- 
ables Yr\,..., YRy with zero mean and variances ae, re a2, one approximately 
has: 


6-0~N,0,5-'), D=FiwF, (11.90) 


where W is a diagonal matrix with 1/o? at the position (i,i) and 6 is the LS 
estimator (see [SW89], Sect. 12.2). . 

If, in addition, Yr; are Gaussian, x7(0) is the variate of a random variable 
asymptotically distributed as x*(v), where v = (n — p — 1), with (p + 1) 
corresponding to the 8 dimension (see [SW89], Theorem 2.1). 


502 11 Least Squares 


Table 11.1 Properties of the LS estimators in the case of known errors. The symbol (*) indicates 
that the property is valid under the conditions of Theorem 11.3. The efficiency refers to the Cramér- 
Rao lower bound and is always maximal in the first (with finite sample size) and in the second row, 
where the estimates are from the ML method, while in the third row, it is instead limited, due to 
the Gauss-Markov theorem, to the correct and linear estimators 


Problem type Properties 

Gaussian Linear Efficient Gaussian x? test 
data? model? estimator? estimates? possible? 
YES YES YES YES YES 

YES No | YEs YES (*) YES (x) 
No YES YES YES (x) No 

No No UNCERTAIN YES («) | No 


Then in practice we can apply Egs. (11.63-11. a to the nonlinear case by replacing 
everywhere x! 6 with I, 6) and every row x; ' of X with the vector of derivatives: 


af (xj, 0) 
a6" a) 


when X is not multiplied by @. 

Theorem 11.3 partially justifies the widespread habit of applying the 30 law to 
the estimation intervals provided by nonlinear LS algorithms and of performing the 
x2 test to check the best-fit solution (see also Table 11.1). However, in important 
cases when any doubts arise, we advise you to simulate the LS procedure with 
artificial data to directly verify the distribution of the estimated parameters and of 
the x7 values using the methods of Sect. 8.10. 


11.8 Model Testing and Search for Functional Forms 


The results we have presented so far are valid if the model assumed for the expected 
response value is correct. 

Indeed, in these situations, the functional form of the model is known, and hence 
the final x7 value can be used to readjust the errors of the experimental points and 
of the obtained estimates. In this case, it obviously makes no sense to perform the 
x ? test. 

If, on the other hand, one is not sure of the functional form, and, for example, a 
model selection is performed by adding or removing parameters from polynomials 
or other empirical functions, the model validity can be judged with the x test only 
if the absolute errors are known. If the assumed model is correct and the errors are 


11.8 Model Testing and Search for Functional Forms 503 


Gaussian, we get: 


n ere 
[Y; — x: 0] 
ye ep), 


i=l i 


and the fit quality can be controlled with the two-tailed test described in the previous 
section. 

As we will show below, in addition to the x? test, it is also possible to perform 
the F test and/or the residual trend analysis. The latter two methods can also be 
used when only the relative error weights are known, as they are not affected by 
the value of a common scale factor. In the general linear model problem, when the 
functional form is not known a priori, one starts by estimating a given hypothetical 
function belonging to the class defined by Eq. (11.48). Consider, for example, a 
useful subclass of regression models where each of the p predictors is a function of 
the same variable x (see also Eqs. (11.49 and 11.50)): 


Pp Pp 
(2,0) =O + DOK felx), f(x, 0) = 00+ DF Hx". (11.91) 


k=1 k=1 


As usual, this function has to be fitted to n experimental points. In the following it is 
necessary to pay attention to distinguish the cases where the absolute error is known 
or not. The latter situation corresponds to have an unknown o; value in Eq. (11.73). 

Recall that the problem consists in finding the curve around which the fluc- 
tuations of the points are random, that is, the curve representing the functional 
dependence between x and the average value of Y. This curve must not pass exactly 
through points (interpolation) and therefore must have fewer parameters than the 
number of experimental points; for this reason it is also called regression curve 
(although historically this term was introduced in a different context). 

The initial choice of a particular function within the subclass (11.91) is made on 
the basis of available information on the problem and, whenever possible, of the 
graphic patterns of the (x;, y;) pairs, as in Figs.11.1 and 11.2. 

The first test to be performed after the x” minimization is to check that the 
residuals: 


23 = yi — (Ki) = yi — Si (11.92) 


behave just like random fluctuations since they are, in fact, an estimate of the random 
fluctuations z;. A plot of the (3;, Z;) pairs is very informative in this sense, since the 
residuals always have zero mean. It can be shown that, if the model includes the 
constant term 99, the sample correlation between the predicted values y; and the 
residuals z; vanishes and there is no linear relation between them. 

If the assumptions made on f(x,@) and on the mean (zero) and variance 
(constant) of the random fluctuations z; are compatible with the data, we should 


504 11 Least Squares 


a) 


Fig. 11.6 Typical residual graphs after a best fit: good fit (a), wrong functional form (b) and 
heteroskedasticity (see text) (c) 


therefore observe in the plot a cloud of points enclosed within a zero-centred band 
of constant width.” 

In particular, if the residuals are also Gaussian, the band will be roughly 
symmetrical around zero. In the weighed case, the residuals for each predictor x; 
must be standardized (i.e. divided by the error): 


be = 
Gio fe y( i). 
Oi Gj 


(11.93) 


The trend must appear random, and the values outside the band |t;| < 3 must be rare, 
in agreement with the 30 law (note that this is a rough check, since the variance of Zi 
is the one given in Footnote 2, with X instead of X ). If a? is unknown, the residuals 
must still be weighed, and in the graph, the pairs (}(x;), ./wiZi) are represented. 
Figure 11.6 shows three possible cases. In case (a) the fit is satisfactory: in case 
(b) the points show a trend violating the 30 law, due to a fit with an inadequate 
functional form; and case (c) instead occurs when the errors of the higher values 
of y have been underestimated. Frequently this happens when a non-weighted fit is 
performed (with absolute errors kept constant) to data that instead have a constant 


2 Note that Var[Z;] = o2(1 — 1j;), where 0 < Jj; < 1 is the element of place ii of the matrix 
X(X*X)—!X?. It may happen that, for certain matrices X, some /;; are significantly different from 
most of the others. 


11.8 Model Testing and Search for Functional Forms 505 


relative error. The absolute error then increases with y, and the residual graph is as 
in Fig. 11.6c; statisticians name this behaviour as heteroskedasticity. 

Several models can produce a visually correct residual plot. In this case, iterative 
methods are used, obtained by adding or removing predictors and by comparing 
pairs of consecutive models from this sequence. In the case of the polynomial of 
Eq. (11.91), we could increase its degree by one unit at a time, minimize the y* and 
decide when to stop. The method is based on the fit of the models and on the answer 
to the question: “when specific predictors f; (x) are removed, is the worsening of the 
fit statistically significant or not?” To respond, let us start by examining Fig. 11.7, 
which shows the vectors involved in our least squares problem in the R” space. To 
understand its meaning, we note that the estimate of the response for each row of X, 
that is, the vector of the quantities (x;), corresponds to: 


3 =X6=X(Xtxy 'xty= Py. (11.94) 


It can be demonstrated that the P matrix simply implements orthogonal projection 
of y on the vector space generated by the columns of X, Le. 6 is the parameter 
vector defining the linear combination of the columns of X that is closest to y. Let 
us now consider the Xo matrix, obtained from X by removing a certain number of 
columns and keeping only po predictors (hence Xo has dimension n x (po + 1)). 
The estimate of the response with this reduced model will of course be ¥y = Poy, 


with Po = Xe xe), i.e. the projection matrix on the column space of Xo. 
This is equivalent to setting p — po parameters inside the vector @ to zero. 

By applying the Pythagorean theorem to the dashed right-angled triangle of 
Fig. 11.7, we obtain an important relation involving the regression residuals of both 
the full and the reduced models. Recalling that 7; 2? = ||ZlI? = 0;( — $i)" 


Fig. 11.7 Vector 
representation in the R” 
space. Regression with all 
predictors (X matrix) and 
with a subset of predictors 
(Xo matrix). The response 
estimate y(X) = jy at each 
row of X is the orthogonal 
projection of y on the linear 
space generated from X 
columns. A similar argument 
applies to jg. It is therefore 
immediate to derive the 
vectors of the residuals of the 
two regressions and the 
relation between them 


506 11 Least Squares 


and ||Zo|I7 = ¢; (i — So(x))?, we have: 
|Zoll* = llZ — Zoll? + W121" . (11.95) 


If y(x) is given by the first of Eq. (11.91) and the functions f;(x) satisfy the 
orthogonality property (11.68), by explicitly writing the residuals, it can easily be 
shown that Eq. (11.95) is equivalent to the condition: 


bi IIIS Gi) —SJoOa)] = 3 YibLi-F@D1 A&A) =0, (11.96) 


i k=potl i 


which can be verified with the normal equations (11.58). From Eq. (11.36) we can 
easily deduce that the elements of the vectors of residuals are linear combinations 
of the responses y;. If the response is Gaussian, the two norms at the second 
member are relative to orthogonal (and then uncorrelated) vectors, which are also 
independent being a linear combination of Gaussian vectors. Moreover, if the 
reduced model with po predictors holds, from Cochran’s Theorem 4.5, we have: 


la ry 2 2 1 ry 2 2 
—s=l4— Al" ip = po) =I4Z4 ox @=p= i. 1.27) 
o2 a? 


& 


These results on the independence and distribution of squared norms give us the 
answer to the question we asked ourselves a little while ago since, if the reduced 
model is valid, then: 


lz — Z0ll?/(p — po) _ (ll2oll? — llzII7) /(@p — po) 
: cc Peg i al a 11.98 
IzlI?/(@ — p — 1) zl?/@ — p—1) 


follows the Snedecor F density with (p — po,n — p — 1) degrees of freedom. 
This density has been defined by Eq. (5.46) as the distribution of the ratio between 
two independent reduced x* variables and has been used in Exercise 7.11. Often 
Eq. (11.98) is written with a different notation: 


_ (RSS — RSS)/(p = po) A505 
RSS/(n — p— 1) 

with RSS = ||Zol|? and where RSS = ||Z||* is the acronym of residual sum of 

squares. The statistics (11.99) quantitatively measures the worsening of the reduced 

model fit with respect to the full model by means of the RSS increase. 

We note that the F test, being a ratio between x” variables, can be performed 
also when only the relative errors are known. A significantly large value of F 
indicates that the predictors removed from the complete model were important. For 
this reason we reject at the aw level the hypothesis that the corresponding coefficients 


11.8 Model Testing and Search for Functional Forms 507 


are null if: 
F > Fi_a(p — po.n—-p-1). (11.100) 


A particular case is when the reduced model is obtained by removing only a single 
predictor, for instance, by setting 6, = 0 into one of Eq. (11.91), so that pp = p—1. 
The F variable, under Ho : 6, = 0, has an F(1,n — p — 1) distribution, and it is 
easy to recognize that the equality Fj_,g(1l,n— p-—1)= tt /2(n — p — 1) holds, 
where ¢ is Student’s ¢ percentile with (n — p — 1) degrees of freedom. Indeed, F is 
also the square of: 


A 


_ 1) 
sp) 


(11.101) 


where sp) is given by Eq. (11.63) and 7 has Student’s t-distribution with (n — 
p — 1) degrees of freedom. 

All this can be verified by reading the discussion about the confidence intervals 
for the least squares line parameters of Eq. (11.40), where p = 1,a = 6) andb = 0). 
Having this in mind, we will reject Hp and accept the new (po + 1)th parameter if 
T exceeds t}_y/2(n — p — 1). In other words, a large value of T indicates a small 
relative error and therefore the importance of the parameter. Recalling the problem 
of choosing the degree of the polynomial in Eq. (11.91), we then could, for example, 
start from the model with only 6 and verify the hypothesis Hp : 0; = O. If this is 
rejected, we will add the second degree term and test the hypothesis Ho : 62 = 0 and 
so on. If k is the index of the first test that does not reject Ho, we will set p = k—1. 
This is just one of the possible procedures. If by chance the second degree term is 
not significant, but the third degree term is significant, with this procedure we would 
not notice it. 

It is important to point out that all what we have described so far are necessary 
but not sufficient conditions to ensure that the functional form has been correctly 
found. In other words, there are many functional forms (not just the “true” one) that 
satisfy the best fit criteria listed above. This is a basic ambiguity, not resolvable only 
Statistically, which one must always be aware of. To reduce this uncertainty, it is 
therefore essential to introduce, inside the model, all the a priori known information 
about the problem and try to reduce errors as much as possible. 

We exemplify these concepts by simulating a realistic case. Starting from the 
polynomial: 


y = Oy + 6x7 + 63x? = 3 4+ 5x7 —0.5x>, (11.102) 


and applying the techniques of Chap. 8, we generated simulated (artificial) Gaussian 
data having a rather large relative standard deviation of +15%: 


yi = 1+ 0.15 g/)(3 + 5x? — 0.5x}) , 


508 11 Least Squares 


Table 11.2 Results of the best fits to the data of Eq. (11.103) with polynomials of the type y(x) = 
Oo + Ox + Ox? + 03x3 assuming known relative errors. The estimate s, is given by Eq. (11.79), 
whereas RS'S is the weighted sum of squared residuals, that is, the numerator of Eq. (11.79) 


FIT1 FIT2 FIT3 FIT4 FITS 
ay —5.142.7 —15.143.0 -—6.1+5.1 1841.1 
(cal 12.8+1.4 23.242.8 10.6 + 6.7 2.74 1.4 
62 -15+0.4 2.842.1 5.1+0.8 6.0 +0.5 
03 —0.4+0.2 |—0.6+0.1 —0.6+0.1 
RSS 0.380 0.097 0.049 0.066 0.079 
Ss 0.252 0.139 0.111 0.115 0.126 


where g; is a standard normal variate. The data are reported in Eq. (11.103): 


x 10 20 30 40 50 60 7.0 80 
y 69 224 40.8 644 60.0 81.0 78.0 70.5> (11.103) 
o ll 29 52 7.7 98 11.3 115 10.1 


where 0 = 0;//w = 0.15/./w and 1/./w = 3 + 5x* — 0.5x°. The different 
coefficients of the different regression polynomials obtained from the fits of these 
data are shown in Table 11.2, together with the values of RSS. They have been 
calculated with our code Fit Polin which uses the Linfit routine. By storing 
the vectors x, y of Eq. (11.103) and the weights winx, y, wz, the instructions for 
the polynomial FIT3 of Table 11.2 are the following:* 


>class(fitfun<- y~x+I(x*2)4+I(x*3)) 
>FitPolin(x=x, y=y,dy=1/sqrt (w) ,fitfun=fitfun, ww=’ REL’ ) 


With the options ww=’REL’ and dy=1/sqrt (w) , we assume for the moment 
that only the relative weights w are known, while the overall scale factor o, is not. 
Therefore the confidence intervals of the parameters are given by Eq. (11.64) with 
the error given by Eq. (11.81). However, it should be recalled that the errors on the 
parameters and the estimate s, of o, are reliable only if we use a functional form of 


the model not far from the true model, so that s,,/(Xt WX) ~ oz,/ (Xt WX)o. 


Moreover, as we will briefly explain in Sect.11.10, the x” test has not to be 
performed. We can only compare the fit results between two models using the F’ 
test of Eq. (11.99), where the weighted sum of squares of the residuals must be 
used: RSS = (y — x6)' W(y - X6). This comparison is also useful to discard 
over-parameterized models. 

In our example, the fit with the straight-line model: 


y(x) = 09 + 1x , 


3 The use of the objects of the formula class, as fitfun, is described in the R on line manual. 


11.8 Model Testing and Search for Functional Forms 509 


Fig. 11.8 Residuals (11.92) 


of the regression for (a) FIT1 a) e 
and (b) FITS. Compare these 2 
values with those shown in r e 
Fig. 11.6 re 
a e 
0 e 
. e 
4 | | | l 
0 20 40 60 80 A 
zy 
b) 
2 = 
r 
1 e 
e “ ; 
0 ; ° 
4 | | | | 
0 20 40 60 80 A 
y 


denoted as FIT1 in Table 11.2, fails the check with the residual plot: indeed, 
Fig. 11.8 has a behaviour similar to that of Fig. 11.6b. 
We then consider a quadratic polynomial: 


y(x) = 0 + O1x + O2x7, 
obtaining the FIT2 result. The F test ratio (11.98) between FIT2 and FIT! gives: 


F(d,5)= aos = 146. 


From Table E.5 at 5% and 1% levels, one obtains: 
14.6 > Fo95(1,5) = 6.61, 14.6 < Foo90(1,5) = 16.3. 


The test shows that the null hypothesis (i.e. the uselessness of FIT2) can be rejected 
at the 5% level, but not at the 1% level. If we consider that the solution is very close 
to the 1% level, it is legitimate to consider the three-parameter solution of FIT2 to 
be significant. 

We could stop at this point, but let us see what happens with FIT3, in which a 
cubic polynomial with four parameters is used: 


y(x) =O + Ox + 02x" + 63x° : 


510 11 Least Squares 


The F test now gives 3.9 < Fo.99(1, 4) = 21.20, showing that the fourth parameter is 
useless. Moreover, all four parameters of FIT3 have a small T value in Eq. (11.101) 
and are therefore compatible with zero. All this shows that four free parameters 
are redundant. In FIT4 and FITS the fit is attempted with cubic curves but with 
the suppression of the parameters 0) and 61, respectively. The F test with the 
comparison of FIT4 and FITS versus FIT3 again indicates that the four free 
parameters of FIT3 are too many compared to the three of FIT4 and FITS, showing 
that the resulting fits are practically equivalent to FIT2. Finally, in the last row of 
Table 11.2, the different estimates s, of the common multiplicative scale error o, are 
reported. They have been obtained from the Fit Polin routine using Eq. (11.79). 
Apart from the clearly wrong case of FIT1, all the other fits give similar values. 

Let us now consider the case of known absolute errors o;. Our code Fit Polin, 
called with the options dy=sy and ww=’ ABS’, where sy is the vector containing 
the last row of Eq. (11.103), gives the output results reported in Table 11.3. The 
parameter errors, given now by Eq. (11.80), are different, and the final value of 
x7/v can be used to perform also the x7 test. The observed significance level SL, 
reported in the last line of Tabble 11.3, confirms the previous conclusions. 

The results FIT2, FIT4 and FITS of Table 11.2 are then all equally compatible, 
although we know (but only because the data have been simulated by Eq. (11.102)) 
that the solution closest to the true one is given by FITS. 

The correct conclusion is therefore the following: we have proved that, starting 
from a polynomial of first degree and progressively adding higher degree terms, 
the eight values of Table 11.103, affected by a a relevant +15% relative error, are 
compatible with a quadratic or cubic functional form, having no more than three 
parameters. This class of solutions includes the true function (11.102). The use of 
orthogonal functions with the property (11.68) greatly optimizes the minimization 
procedure, because, due to their independence, the addition of new parameters 
does not change, the fitted values of the previous ones. However, this procedure 
generally does not help resolve ambiguities. The situation is effectively summarized 
in Fig. 11.9, where we immediately see that, in the case of absolute errors, second- 
and third-degree polynomials are statistically compatible with the data within the 
statistical fluctuations. 


Table 11.3 Best-fit results with polynomials of the type y(x) = 0) + 0)x 4 6x2 + 03x° for the 
data of Eq. (11.103) for the case of known absolute errors 


FITI FIT2 FIT3 FIT4 FITS 

4% —5.1+1.6 —15.1+3.2 —6.1£7.0 1841.3 
1 128+0.9 23.2431 10.6+£9.1 27+£1.8 

0 -1.5+04 2842.9 5.141.0 6.0 + 0.6 
03 -04402 —0.6+0.1 —0.6+0.1 
x2/v 2.8 10.9 0.5 0.6 0.7 


SL 2% 98% 60% 58% 712% 


11.9 Search for Correlations 511 


Fig. 11.9 Experimental points from Eq. (11.103) and and results of the polynomial fits given in 
Table 11.2. Dashed line, FIT 1; dotted line, FIT2; point-dashed line, FIT4; and full line, FITS. Apart 
from FIT1, all the other polynomials have three free parameters 


The figure also shows that the extrapolation of the curves outside the data range 
is extremely dangerous: the correct extrapolated value, for x = 10, is that of FITS, 
which is the full curve; completely different results are obtained with the other 
curves, which however well represent the data within the measurement interval. 


11.9 Search for Correlations 


The search for functional forms, which we have just described, mainly investigates 
the analytical relation (11.3) between x and y described in Sect. 11.2. On the 
contrary, when also X is a random variable, as in the case of Eq. (11.5), and 
we examine the link (11.11) between f(X) and the correlation coefficient, the 
problem is usually called search for correlations. Now, according to Eqs. (11.6- 
11.7), the variance at the denominator of the x7 variable is a constant representing 
the response fluctuations. We are therefore in the unweighted fit case. 

Suppose to have n occurrences of the pairs of variables (X, Y). If we denote by: 


5 = $i) = fi, 8) (11.104) 


the estimate of the Y mean values at any given x, the decomposition of Eq. (11.95) 
becomes in this case a decomposition of the sample deviance of Y. Therefore, 


512 11 Least Squares 


Eq. (2.40) can be written as: 


n n n 
Yor - ON = OG - ON + 2G - Hi)? - (11.105) 
i=l i=1 i=1 
— 
(RSSo) (RSSo—RSS) (RSS) 


In words, this decomposition corresponds to: 


total sum of squares = explained sum of squares + residual sum of squares . 


(11.106) 


This equality is the simplest form of analysis of variance, which we have extensively 
described in Sect.7.9, and has a very interesting interpretation: the dispersion 
of y; around (y) is decomposed into two uncorrelated parts: the first one is the 
one identified by the linear regression model with p predictors (i.e. by f(X) in 
Eq. (11.5)), and the second one is the residual, that is, the sum of squares of 
the deviations, denoted by Z in Eq. (11.5), that the model cannot explain. If the 
model interpolated the data, we would have y; — }; = 0 for each i, a statistically 
unacceptable result. 

From Eq. (11.60) we deduce also 6) = (y) = yo; and hence Zo; = y; — (y). 
Therefore: 


Zo? = YoOi — 9) = @— Dsp. 


Moreover, from Fig. 11.7, we see that ||Z—Zo|l = ||)— Jol. By dividing the explained 
sum of squares of Eq. (11.106) by the total sum of squares, we obtain the coefficient 
of determination: 


n A 2 n a \2 
R= ini Gi — {y)) aA Or WO) eee 11.107 
Dini Oi — (y))? ay? (n — 1)s} 


Recalling Eq. (11.105), we see that 0 < R* < 1. The upper limit is reached when 
the model interpolates the data while R? = 0 only if 6, 6 =... 6» 0, a 
condition that occurs if y does not depend linearly on any of the model predictors, 
so that it has “residual” fluctuations only. In practice, the zero value is never reached 
exactly, but sometimes very small values can occur. 

From Fig. 11.7, it can be shown that R? is the square of the sample correlation 
coefficient between y; and };. This parameter is called multiple correlation coefficient 
and can be interpreted as a measure of the sample correlation between y,; and the 
rows x; of the X matrix, since each 3; is the prediction of Y at x;. In the specific case 
of the least squares line (when p = 1), substituting Eq. (11.29) into the numerator 


11.9 Search for Correlations 513 


of Eq. (11.107), one obtains the relation: 


n 2 n 

LOH OY as yea Sr ey 
i=l * j=1 

which shows that R? = r? is the square of the linear correlation coefficient between 

the values x; and y;. 

To complete the analysis of the decomposition (11.105), we mention the 
coefficient of determination corrected for degrees of freedom, which is given in 
the output of many least squares estimation codes, as in the case of our Linfit 
routine. It is defined by: 


pea ee = H/@ = PV _ ,_ RSS/@—-p-1) 

a1 0% — )7/@ -1) - 
The search for correlations aims to find the function f (x) maximizing R”, that is, the 
function that maximizes the explained sum of squares and minimizes the residual 
one. We outline the procedure for functional forms of the type (11.91), keeping in 
mind that it still holds for any multiple linear regression model: 


(a) Given a function f(x, @) which is linear in the parameters, the LS estimate 6 is 
evaluated by minimizing the quantity: 


02x70) = bi — FOr, OF . (11.108) 


(b) Once the estimate has been obtained, the explained sum of squares is calculated 
together with the coefficient of determination R?. 

(c) Among all the functional forms hypothesized in point (a), the one that has the 
maximum R? is selected. If there are two or more almost equivalent solutions, 
the one having the smaller number of parameters is chosen (the result does not 
change if RSS = o2x?(6) is minimized instead of maximizing R?). 

(d) The trend of the residual plot of the chosen solution is checked; a random 
behaviour indicates the absence of structures clearly due to functional depen- 
dencies not seized by f(x, 6). 


Let us apply these rules to a data set simulated with the algorithm: 


xX =xot+xr = 104221 
f(x) =24+x? (11.109) 
y= f@+Z=242x74+580, 
where g; and g» are standard Gaussian variates. X is then a normal variable with 


[Lx = 10, 0, = 2, whereas Y is not normal because the correlation f(X) is nonlinear; 
however, the fluctuations around the correlation function are Gaussian with o, = 


514 11 Least Squares 


5. Twenty (x;, yj) pairs obtained using the previous algorithm are reported in the 
following table (11.110): 


x 6.28 6.62 7.10 7.46 7.54 8.11 8.62 8.95 9.00 9.62 
y 40.2 47.3 47.4 48.7 522 68.3 79.8 88.3 86.4 97.7 

x 9.92 10.10 10.11 10.25 10.34 11.09 12.23 12.46 13.31 13.82° 
y 98.7 102.2 101.7) 111.2 105.4 118.5 154.3 164.0 182.5 191.1 


(11.110) 


The results obtained with our routine Fit Polin by minimizing the x7: 


ray Gy 9 — 01x; — 02x? — 03x})° 
F 


and considering polynomials up to the third degree are reported in Table 11.4 and 
Fig. 11.10. The parameter errors are calculated with the Linfit routine using 
Eq. (11.79). 

As an example, assuming to have loaded the vectors x and y of Eq. (11.110) in 
xx, yy, the code instructions for the FIT2 polynomial of Table 11.4 are: 


>class (fitfun<-yy~x+I (x*2) ) 
>FitPolin(x=xx,y=yy, fitfun=fitfun, ww=’NO’ ) 


where the parameter ww specifies that the errors on the variables are not given. If we 
now apply rules (a)—(d) just discussed, FIT3 appears as the best solution: 


f(x) = -1.02 + 1.03 x? . 


This solution maximizes the explained variance to 99.1% and minimizes the 
Gaussian fluctuations around the regression curve to oe = 4.4. The FIT2 solution 
gives the same results, but with one additional parameter and too large parameter 
errors (i.e. with too small values of the T statistic of Eq. (11.101)). 

The most reasonable conclusion is therefore that the data have a parabolic 
correlation function f(x) ~ x, with a small or even negligible constant term 


Table 11.4 Best-fit results with polynomials of the type y(x) = 09 + O;x 4 4.x? + 63x° for the 
data of Eq. (11.110) 


11.9 Search for Correlations 515 


Fig. 11.10 (a) Experimental 200 
points of Eq. (11.110) and 

fitted polynomials from 1 
Table 11.4. Dashed line, 50 
FIT1; full line, FIT3; dotted 
line, FIT4. The FIT2 solution 
gives graphically 
indistinguishable results from 
those of FIT3. (b) Residuals 
(11.92) corresponding to the 
three solutions of Fig. (a) 


100 
50 


1p 


compared to the mean of the values of y, since the error on the parameter is large 
(09 € —1.02 + 2.45). The absolute fluctuations around this curve are ~ 4.4 (in s, 
units). As you can see, this conclusion is quite close to the “truth” represented by 
Eq. (11.109). Also the residual plot, shown in Fig. 11.10b, demonstrates that the 
FIT3 solution has a more regular residual fluctuations than the others. 

Finally, we note that if we progressively increased the number of parameters, at 
some point we would certainly find even larger values of R* and even smaller values 
of cae In the most extreme situation, with 20 parameters, we would go through all 
the points exactly, getting R? = 1 and ce = 0! However, these correlation studies 
must be performed with functions far from the interpolation limit and for which 
the fluctuations of the data must appear as random. The functional forms to be 
tested, such as the maximum degree of polynomials, must therefore be determined 
a priori on the basis of the available information about the problem, to reach a 
compromise between the best possible fit and a model with the minimum number 
of free parameters. 

To conclude this section, we report the connection between the F statistic used to 
verify the null hypothesis Ho : 6; =... = 6, = 0 and the R? parameter. Comparing 
Egs. (11.105) and (11.107), we get: 


> RSSy)— RSS (RSSq — RSS)/RSS pF 
~ RSS) (RSSo — RSS)/RSS+1. pF +(n—p—l)’ 


516 11 Least Squares 


where F ~ F(p,n — p — 1). Since R? is an increasing function of F, the choice 
of the functional form having the highest values of R? is equivalent to select the 
solution that, when tested against Ho, rejects it with the smallest p-value. 


11.10 Fit Strategies 


At the end of this long discussion on model fitting and testing, we try to give you 
some further guidance on the practical procedures to follow. 

We start by recalling that in the previous sections we have always considered the 
case of Gaussian errors. 

If the errors are reliable but the observed variables are not Gaussian, high 
x° values are usually obtained because often the points are more scattered than 
expected from normal deviates. In this case, y* values corresponding to significance 
levels which are small from a purely statistical point of view (even of per thousand 
levels or less) are sometimes accepted. This choice, i.e. to associate the high x7 
value to the data non-Gaussianity rather than to the model function, could be 
sometime justifiable from a practical point of view. 

An approximate but immediate view of the result quality can also be seen by 
eye before carrying out the x? test: if the experimental points “touch”the curve 
y= fx, 6) within one, two or three error bars according to the 30 law, then the x7 
will probably be acceptable. Figure 11.11 exemplifies the situation; remember that, 
by convention, the error bar is always equal to to. 

Most minimization codes, if the errors are not specified, tacitly assume that they 
are all the same and that o, = 1, without warning the users of ... the risks they are 
taking. If we can actually assume oa? = ae for each i without knowing a, we will 
then have to provide its estimate s, through Eq. (11.63) to be able to calculate the 
errors on the fit results. This process is known as error readjustment or rescaling. 
We recall that this procedure is valid only if there are no doubts both about the error 
constancy and on the functional model 1; (x;, 0) = x!0 used. Furthermore, it makes 


Fig. 11.11 Bad fit (a) anda 
good fit (b); in the second 
case, about 68% of points 
“touch” the regression curve 
within one error bar 


y a) y b) 


11.11 Nonlinear Least Squares 517 


a) b) 


xX | xX x 


Fig. 11.12 When data fluctuate around a parabola with constant errors as in (a), a fit assuming a 
straight-line model and using the error rescaling will find a good x? but with an incorrect error 
overestimation as in (b). In (c) a linear weighted fit of points with different errors is shown, giving 
as the result the full line. The solution of an unweighted fit, which assumes equal errors, assigns 
the same relevance to all points, thus giving an incorrect result, given by the dashed line 


no sense to perform the x? test at the end of the fit, because, after replacing o? with 
s? in the x2(6) formula, we will always get x2_, (6) = (n — p — 1) (aconstant!) 
for any functional model. Without these precautions, we could mistakenly evaluate 
as correct a fit that is not, since the expected value (n — p — 1) of x?(n — p — 
1) is well below the critical value. The problem is visualized in Fig. 11.12, where 
we assume the data to be parabolic and with a constant error, as in Fig. 11.12a. 
A straight-line fit with error adjustment will give the result in Fig. 11.12b, where 
the final x? is obviously good, but only because errors larger than the real ones 
have been estimated. The correct result is only obtained by representing f(x, 0) asa 
parabola 6)+6|x+62x*, which, of course, presupposes the a priori knowledge of the 
functional form of the model. The x? test to check the validity of the model function 
must therefore be performed only when the absolute errors are known. However, it is 
possible, using the F test of Eq. (11.99), to check on the elements of linear models 
even when only the relative error ratios are known. 

Finally, another frequent gross mistake is to provide minimization codes only 
with data without errors even in the case of variable errors; most of the time this 
procedure gives wrong results, as shown in Fig. 11.12c. 


11.11 Nonlinear Least Squares 


As mentioned in the previous section, the model function to be used in the x? 
minimization can be nonlinear in the parameters. An example of this is given in 
Exercises 10.6 and 10.7, where the model introduced in the function to be minimized 


518 11 Least Squares 


(11.2) was represented by the Gaussian and binomial densities, respectively. We 
solved these exercises, using our routine Nlinfit for nonlinear minimization, and 
the results have been also examined. In this section we intend to briefly describe the 
algorithms used in this class of codes to evaluate the minimum of the function x (0) 
in a p-dimensional space, without giving too many numerical calculation details, 
which are extensively described in other texts such as [BR92] or [PFTW92] and 
also mentioned in our web pages [RPP]. 

The most efficient algorithms are based on the negative gradient method, since 
the opposite of the gradient is a vector that always points towards the minimum. 
Therefore, proceeding by successive steps, they reach a region where the second 
derivative is positive, until when the function starts to increase again. Around the 
minimum the x7 function is expanded in parabolic approximation: 


ax? 1 a7 x? 

2 2 

= —o,+-= —— 0; . 11.111 

x Wh Daa ee 70, 8% ( ) 
jou jk od 

To fix ideas, let us assume that the x7 function is of the type of Eq. (11.6): 


dip : 2 
goa) ET aye, 


i o; 


where the index i refers to the measured points; the second derivatives then take the 
form: 


a? x7 a a a OH; 
mar = 203m = 3; any 
8090 00; 0K & 00; < OK 
dH; 0H; a? Hj 
=O ee ee 11.412 
~ 06; ja ay " 90; 90% ( 


Neglecting the second term, the formula: 


ax? OH; 0H; 1 Of; of; 
2 >> 


a? 80; 90k” 


is eee (11.113) 
30,00, ~ 30; 96; 


i 


is obtained. Experience has shown that this approximation has many advantages: the 
algorithm is faster, as accurate as the algorithms that use higher-order expansions, 
and the matrix of second derivatives is always positive definite [Jam08, Jam92]. 

It is also important to note that setting to zero the second derivatives of the 
function: 


yi — f Gi, 9) 


Oj 


Aj (0) = 


11.11 Nonlinear Least Squares 519 


with respect to the 6 parameters, implies that the second derivatives of the function 
f(«,@) vanish. This means a linear first-order Taylor expansion of f around the 
starting point 6*: 


a 
fer) = F004 A] ae, (11.114) 
where AO = (6 — 6*). In this case Eq. (11.111) is an exact relation, since 


the x? derivatives of higher order than the second are zero. If we now keep in 
mind the formulae described in Sect. 11.5 and operate, in Eqs. (11.55—11.58), the 
substitutions: 


O, > AO, = OK -—O6 , Vi > Vi -— FIO). Xi 


we see that the minimization equations are, step by step, identical to Eqs. (11.57- 
11.60). 

The X matrix of Eq. (11.54) now contains the p first derivatives of f(x, 6) 
calculated at the point 6* and at the n experimental points x;. The matrix (X'WX) 
of Eq. (11.59), which in this case is called curvature matrix, contains the x” second 
derivatives under the form of products between the first derivatives of f and the 
weights 1 for, Once the minimum with respect to the A6, values is found, the 
procedure restarts from the new point 6;. At the minimum point, the errors on the 
parameters are obtained through the error matrix (11.62), as in the linear case. For 
more details, you can examine the Nlinfit routine and at its comment lines. 

When the x” to be minimized is a likelihood function, from Eq. (10.45) it results: 


x7(0) = —2InL(0) + C =-2) Inf (x7,0) + C. (11.115) 


The repetition of the previous calculation shows that, after the linearization of the 
function f, the second derivative of x? is: 


a7 x2 1 Of; df; 
zs ee (11.116) 
90; 00% fe 00; 904 


i 


Also in this case, it can be shown that the second derivative matrix is always positive 
definite [Jam08, Jam92]. 


520 11 Least Squares 
11.12 Problems 


11.1 Find the linear correlation coefficient (4.31) when f(X) = a+ bX and X and Z 
are independent. 


11.2 Two quantities X and Y, linked by a linear dependence, are measured with an 
instrument having an accuracy of 10%: 


x 10 20 30 40 50 60 70 
y 214 38.8 52.2 88.1 99.5 120.4 158.3 


Determine the regression line using the routines Linemq and Fit Polin, assuming the 
accuracy to be an uniformly distributed interval [x —0.05 x, x +0.05 x] (and similarly for 
y) that contains the true value with CL = 100%. Calculate the x? value and comment 
the result. 


11.3. Two measured quantities X and Y have Gaussian distributed errors: 


x 10 20 30 40 50 60 70 
sy 03 06 09 1.2 14 1.7 2.0 
y 20.5 40.0 63.6 86.7 104.3 123.3 144.7 
Sy 06 IL1 19 25 3.1 3.5 4.1 


Determine the regression line with the routines Linemq and Fit Polin. Calculate the 
x? value and comment the result. 


11.4 The vertex problem: using the least squares method, determine the common vertex 
(xo, yo) of a set of straight lines y = a; + b;x. (Hint: consider the equation of a pencil 
of straight lines passing through a point: (y — yo)/(x — xo) = b). 


11.5 A Gaussian variable Y, measured as a function of X, provided the values: 


x 10 11 14 #15 18 20 2.2 23 24 2:5 
y 5.16 5.96 6.29 7.41 7.31 8.26 9.15 9.51 9.96 9.03 


Knowing that Y has a constant standard deviation, determine if a first- or second-degree 
polynomial relation between X and Y is statistically compatible with the data. The 
FitPolin routine can be used. 


11.12 Problems 521 


11.6 The sampling of two Gaussian variables X and Y provided the result: 


x 1.271 0.697 2.568 2.400 2.879 2.465 2.472 2.039 2.277 1.392 
y 6.05 3.57 13.88 13.77 15.77 13.61 13.86 11.30 12.77 6.38 


Determine the correlation function between these variables using the Fit Polin routine 
with a first- or second-degree polynomial function. 


11.7 The result of five measurements y; as a function of exact values x; is: 


x2 4 6 8 10 
y 79 11.9 17.0 25.5 23.8 


The Y values have a Gaussian relative error equal to +10%. Determine the coefficients 
of the functional dependence Y = a + bX. Analyse the obtained result by simulating 
20.000 times the fit procedure. 


11.8 The following five measurements y; are given as a function of x; values: 


x 185 3.77 5.74 7.56 9.66 
y 8.87 13.90 17.70 22.91 23.59 


The values of X and Y are affected by uniform relative fluctuations of +10%. Determine 
the a and b coefficients of the functional dependence Y = a+ bX. Analyse the obtained 
result by simulating 20,000 times the fit procedure. 


Chapter 12 ® 
Experimental Data Analysis od 


You see, it depended on one or two points at the very edge of the 
range of the data, and there’s a principle that a point on the end 
of the range of the data -the last point- isn’t very good, because, 
if it was, they’d have another point further along 


Richard P. Feynman, “SURELY YOU’RE JOKING, 
MR. FEYNMAN!: ADVENTURES OF A CURIOUS 
CHARACTER”. 


12.1 Introduction 


The technical and more extensive part of this chapter describes how to apply 
statistical and probabilistic methods to the various types of measurements and 
experiments that are usually carried out in a scientific laboratory. 

Our main purpose is to enable the researcher or the experimenter to recognize 
the type of measurement he/she is carrying out and to properly evaluate its accuracy 
and precision. Here, we will not deal with the important problem of finding physical 
laws through best-fit techniques, because this crucial aspect has been extensively 
treated in the previous Chap. 11. 

Together with the technical topics, we will also address some very important 
conceptual and methodological aspects directly related to the foundations of the 
scientific method that permitted the birth and the development of modern science. 
This method is based on the observation of natural phenomena, i.e. on the data 
collection and analysis, according to those principles and procedures that were 
systematically adopted for the first time by Galileo Galilei and which have then 
consolidated and improved over the last four centuries. Mathematical and statistical 
analysis of data play, in this context, a role of primary importance. 

Today, disciplines such as physics and medicine are considered sciences, as they 
are based on experimental facts, while this is no longer true for astrology, because 
the latter is based on people’s expectations and is totally disproved by the facts, 
when these are analysed, as in Exercise 3.17, with the scientific method. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 523 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per il 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3_12 


524 12 Experimental Data Analysis 


The distinction between the sciences in a broad sense, such as medicine, and the 
so-called hard (or exact) sciences, such as physics, is more subtle. More correctly, 
the distinction should be made between totally and partially quantitative sciences. 
An adequate definition of hard science, if we really want to use this term, could be 
the following: 


Statement 12.1 (Hard Science) A science is said to be hard (or exact) when it is 
always able to associate an error (uncertainty) with its predictions and results. 


Basically, the term “exact” doesn’t imply that the results must be affected by zero 
or negligible error; it is instead synonymous with “quantitative”, which indicates 
a method providing results that are predictable with certainty or with reliable 
confidence levels. As you already know, the calculation of the measurement error 
or uncertainty is of fundamental importance for the correct determination of the 
confidence levels. 

From the theoretical point of view, errors are sometimes present when simplified 
models of the phenomenon under study are used or when calculations are carried 
out with approximate numerical methods; however, we will not delve here into these 
particular aspects. 

In the following, we will then only consider the point of view that the exper- 
imental errors derive from the random or systematic fluctuations or uncertainties 
connected with the measurement operations. Often the most difficult and laborious 
part of an experiment is precisely the evaluation of errors. In this phase, the 
experimenter is not guided so much by theorems or precise rules but rather by 
experience and a series of assumptions, sometimes even a little arbitrary. However, 
there is an important constraint: these assumptions and rules of thumb must always 
be in accordance with the fundamental laws of probability and statistics. We also 
note that some of the subjective and arbitrary a priori assumptions we will make are 
about the shape of the statistical distributions to be associated with the behaviour 
of the instruments or to the measurement operations that are performed. For an in- 
depth analysis of the consequences linked to these choices, you can read [D’A99]. 

Finally, remember that in this chapter we will use the notation (6.16) for 
confidence intervals and that the meaning of the confidence levels associated with 
these intervals is the frequentist one widely discussed in Sect. 6.2. These are the 
conventions currently used in the international scientific literature for the results of 
laboratory experiments. 


12.2 Terminology 


The measurement of a physical quantity with an experimental equipment can 
be sketched as in Fig. 12.1. As we can see, the uncertainties characterizing the 
measurement can be referred to the measured physical quantity, to the instrument 
or to the interaction between object and instrument. There is currently no single 
terminology for describing these uncertainties. The ISO recommendations [fS193] 


12.3 Constant and Variable Physical Quantities 525 


Physical object Measurement : Instrument 


constant or 


statistical errors || Sensitivity 
variable values | <> | 
' systematic errors ‘| accuracy 


Fig. 12.1 Sketch of measurement operations 


are to classify them into two types: those that can be treated only with statistical 
methods and those (called systematic or systematic effects) that must be treated 
with other methods. The current nomenclature normally uses the following terms: 


: / Statistical uncertainty, statistical error, random error 
uncertainty \ : . ; : 
systematic uncertainty, systematic error, systematic effect 


The recommendation of [fSI93] is to use the terms statistical uncertainty and 
systematic effect. We believe that it is more appropriate to further distinguish 
between systematic effect and systematic error. The systematic effect must be known 
and corrected before the data analysis, and the systematic error is the uncertainty 
that remains after the systematic effect has been removed. For example, if data 
depends on atmospheric pressure, and one has daily average pressure values for a 
nearby location, the data can be corrected day by day with this value. One can figure 
out that during the day there are small variations around the average pressure values 
that have been used: this uncertainty must then be added to the other uncertainties 
of the measurement as a systematic error. We make the following choice, which we 
will stick to throughout the chapter: 


/ Statistical uncertainty or statistical error 


uncertaint ; : . 
YN systematic uncertainty or systematic error 


(12.1) 


We will now examine in detail all the cases that may happen. 


12.3 Constant and Variable Physical Quantities 


When starting an experiment, first of all itis necessary to check whether the quantity 
to be measured is a constant or a random variable. 

In the first case, if the fluctuations are observed during the experiment (different 
results at each measurement while keeping the experimental conditions constant), 
they are to be attributed to the measurement operations or to the behaviour of the 


526 12 Experimental Data Analysis 


instruments used. In the second case, the observed fluctuations will also include 
those of the measured object. This component of the fluctuations contains the 
physical information about the statistical law (2.6) of the quantity being measured. 
The situation can be summarized in the two operational definitions: 


Statement 12.2 (Constant Physical Quantity) A physical quantity is called con- 
stant when it is a universal physical constant, i.e. a quantity that has the same value 
in all reference systems (Planck’s constant, electron charge, rest mass of a stable 
particle, ...) or a quantity which can reasonably be considered constant and stable 
with respect to the measurement that is being carried out. 


Statement 12.3 (Variable Physical Quantity) A physical quantity is said to be 
variable when it has measurable fluctuations and variations that are intrinsic to the 
physical process being studied. Very often the fluctuations are purely statistical, and 
then the quantity is a random variable that has a specific distribution. The purpose 
of the measurement is precisely the determination of this distribution. 


Some examples of random physical quantities are: 


— The speed of a gas molecule (/x2 density with three degrees of freedom, known 
also as Maxwell density (see Exercise 3.10) 

— The number of cosmic rays per second (approximately 100) that pass through 
your body as you are reading this page (Poisson law) 

— The number of electrons passing through the cross section of a conductor in a 
given time interval 

— In general, all the quantities studied in mechanics and statistical physics 


12.4 Instrumental Sensitivity and Accuracy 


The behaviour of an instrument is defined by two important characteristics: 
sensitivity (also called resolution) and accuracy. 

The sensitivity denotes the smallest change in the measured variable to which the 
instrument responds. 


Statement 12.4 (Instrumental Sensitivity) [f an instrument provides the value x 
in the measurement of a physical quantity x, the sensitivity interval or resolution is 
indicated by Ax, that is, the minimum quantity necessary to move the result of the 
measurement from the value x to a contiguous one. The sensitivity is defined as the 
ratio: 


_ ol 
~ Ax 


In an ideal instrument, with high sensitivity, Ax ~ 0 ed S >> 1. If Ay is the read-out 
variation for a change Ax of the measured quantity, the sensitivity can be defined as 


12.4 Instrumental Sensitivity and Accuracy 527 


2.155 ! 15 2 2.5 : 3.5 i 


215 


rea 


ae 2.25 + 0.05 b) 


Fig. 12.2 (a) In digital instruments the sensitivity range becomes a rounding error. (b) In analogue 
instruments, the sensitivity interval is given by the width of the minimum read-out interval defined 
by the scale. The measured value is assumed as the midpoint of the interval indicated by the pointer 


S = Ay/Ax. In digital instruments, the sensitivity interval can be clearly defined, 
since in this case it is nothing more than a rounding error. For example, if a well- 
made digital multimetre indicates a voltage of 2.15 V, the true value will be between 
2.145 and 2.155 V (see Fig. 12.2a, since it is reasonable to assume that the rounding 
operations are carried out correctly. In this situation, the sensitivity range is A = 
(2.155 — 2.145) V = 10 mV. Since there is generally no correlation between the 
sensitivity range and the localization of the true value within this range, we can 
say that, if x is the experimental value and Ax is the sensitivity interval, the true 
value will be within the interval x + Ax/2, with uniform probability law and with 
a confidence level of 100%. In summary, the true value is assumed to be uniformly 
distributed within the range: 


true value =x+ ZT (CL = 100%) . (12.2) 


In the case of Fig. 12.2a, one can affirm that the true value is within the interval: 


2.150 + 0.005 V , 


that the sensitivity range is 10 mV and that the sensitivity is S = 100 V~!. This 
means that | V shifts the reading by 100 positions. 

For analogue instruments, the sensitivity range has a less clear definition: a 
pointer (needle, small arrow, ...) moves continuously, but the thickness of the 
pointer itself and the distance between two adjacent marks, present on the graduated 
dial of the instrument, define a minimum read-out interval below which it makes no 
sense to proceed. Generally, the result is still reported in the form (12.2), where the 
measured value is assigned to the midpoint of the interval indicated by the pointer 
and the error is the width of the minimum interval, centred on the read value (see 
Fig. 12.2b). Often there is a tendency to interpolate the reading by eye and thus to 
reduce the error. The procedure is acceptable, but in this case, one must be aware 
that a subjective (finer) scale is used instead of the scale of the instrument, obtained 
by visually interpolating between the notches marked on the dial. In this case it is 
possible to associate with the interval (12.2) not the uniform distribution but the 
triangular one (5.35), centred on the read value and with a width equal to the 


528 12 Experimental Data Analysis 


minimum “ideal” read-out interval evaluated by eye. Instead of visual interpolation, 
it is however better in these cases to use a digital instrument with a higher sensitivity. 

In addition to sensitivity, the other fundamental parameter characterizing an 
instrument is accuracy. It is a non-random deviation between the measured value 
and the true one and usually depends on the uncertainty on the correction to be 
applied to remove the systematic effect. In the following, this uncertainty will be 
denoted by 6. 

We now come to an important point: how do errors due to sensitivity combine 
in an instrument? If the digital multimetre sketched in Fig. 12.2a had a perfect 
calibration, that is, if 6 «< Ax, the true value would certainly be located within 
the interval (12.2). If instead there was a calibration defect (accuracy error), let 
us say of 30mV, then 6 >> Ax and the interval (12.2) would be meaningless. 
Generally, professional and well-manufactured scientific instruments that are in 
good operational conditions have an accuracy range always lower than or at most 
of the same order as the sensitivity range. These instruments are equipped with an 
accuracy table, where the rules for defining a global interval A are given. This table 
allows you to combine sensitivity and accuracy errors, for which Eq. (12.2) is valid. 
In this case A indicates a global interval: 


5+ Ax ~ A(syst) , (12.3) 


which is called instrumental or systematic error. 

If there are sensitivity errors, it is reasonable to assume that the true value is 
equally likely located within the interval (12.2); if instead there is also an accuracy 
component, this is strictly speaking no longer true, because the calibration defect 
generally causes a constant and correlated deviation between the true value and the 
measured one. However, in the absence of more detailed information, the systematic 
error (12.3) is generally associated with a uniform density. At the international level, 
the instruments are divided into accuracy classes, defined on the basis of the relative 
systematic error: 


Classes of accuracy 
CLASS | 0.2 0.5 1 1.5 | 2.5 
+A(syst)/xrs 0.2 % 0.5 % 1% 1.5 % 2.5 % 


where xs is the instrument full scale. For example, an instrument is defined to be 
of class 1 if its total systematic error does not exceed +1% of the full scale reading. 


12.5 Measurement Uncertainty 529 
12.5 Measurement Uncertainty 


We go on with the study of the diagram of Fig. 12.1 and describe the analysis of the 
measurement operations. 

In this process, which involves the interaction between the whole experimental 
apparatus (which can also include the observer) and the quantities that are being 
measured, two types of errors occur, statistical and systematic. 

The statistical errors have been extensively discussed in Chap. 6; in laboratory 
measurements they occur when the stabilization of the experimental operations 
or the measurement operations themselves become critical due to the very high 
sensitivity of the experimental instruments. If you measure the length of a workshop 
bench with a carpenter’s tape and repeat the measurement several times, you obtain 
always the same value, and there are no statistical errors. If, on the other hand, highly 
sensitive optical instruments (such as laser distance metres) are used, the superim- 
position of many different fluctuations (in positioning, calibration or other) means 
that a slightly different value is obtained at each measurement. In this case, we have 
a spectrum of experimental results, and the value of the bench length becomes a 
random variable. If the fluctuations inherent to the measurement process are numer- 
ous, linearly overlap and none of them outweigh the others, then the conditions of 
the Central Limit Theorem 3.1 hold, and the measurements tend to be Gaussian 
distributed. When only statistical errors are present, it is usually assumed that the 
average of the measurements should tend to the true value (mind you: this is just an 
assumption!). The statistical error which, as we know, is the estimate of the standard 
deviation of the distribution of the measures defines the precision of the measure. 

On the contrary, as discussed above, systematic effects are instead due to 
incorrect operations or wrong assumptions about the physical model on which the 
measurement is based (e.g. describing the large oscillations of the pendulum with 
a linear model). Consequently, a non-random deviation is introduced between the 
measured value and the true one regardless of the number of observations made. 

In principle, all possible sources of systematic effects must be eliminated before 
starting the measurement. If this is not possible, at the end of the measurement, 
but before carrying any statistical data analysis, systematic corrections need to be 
applied to measured values, using equations based on physical models or other 
methods. If, for example, at the end of a measurement we realize that the response 
of an instrument drifted due to temperature effects, we will have to quantitatively 
study this behaviour and elaborate equations to correct all the observed data, or we 
will have to repeat the measurement by thermally stabilizing the whole apparatus. 
However, even after all possible verifications have been made, an uncertainty about 
the value of these corrections may remain. This uncertainty affects all data in the 
same way, as in the example of the correction due to temperature. Therefore, in 
these cases, it is also necessary to evaluate the systematic error to be associated with 
the obtained results. 

Sometimes, the distinction between statistical and systematic errors is not clear- 
cut. For example, when we read an analogue instrument and correctly try to 


530 12 Experimental Data Analysis 


minimize the parallax error, we can obtain a series of different read-outs, the mean 
of which will be more or less coincident, for large samples, with the true value. In 
this case the parallax error is statistical. If, on the other hand, we always read the 
instrument sideways, the average of the measurements will always deviate from the 
true value, giving rise to a systematic error. 

This previous discussion about statistical and systematic errors can be summa- 
rized with some fundamental definitions: 


Statement 12.5 (Statistical Error) Jt is that kind of error, due to measurement 
operations, which causes the result to vary according to a certain statistical 
distribution. The mean of this distribution (true average) is assumed to coincide 
with the true value of the physical quantity being measured. The standard deviation 
of the distribution is the measurement error. These two parameters are estimated 
from the mean and the standard deviation of the experimentally measured sample. 


Statement 12.6 (Precision) The precision is determined by the statistical error 
of the measurement, which is given by the standard deviation estimated from the 
measured sample. A measurement is all the more precise the smaller the statistical 
error is. 


Statement 12.7 (Systematic Error) The systematic effect causes the average of 
the measured values to deviate from the true value, regardless of the number of 
measurements that are made. The systematic error arises from the uncertainties of 
the corrections made to eliminate the systematic effects. 


Statement 12.8 (Accuracy) Accuracy is determined by the systematic errors of the 
measurement. A measurement is all the more accurate the smaller the systematic 
errors are. 


Precision and accuracy can be represented in a schematic, but effective, way by 
representing, as in Fig. 12.3, the experiment as a target whose centre denotes the true 
value of the measurement. The results of the measurements can then be symbolized 
as shots on the target. A measurement that is neither precise nor accurate can be 
represented as a set of scattered points, with a centre of mass (mean) different from 
the true value (the target centre). In a precise, but not very accurate, measurement, 
experimental results are arranged around a value, which however can be very 
different from the true one. An accurate, but not very precise, measurement gives 
a set of points that are considerably dispersed, but with their average close to the 
true value. Finally, a precise and accurate measurement gives narrowly dispersed 
points that are grouped around the true value. It is evident that only an accurate 
measurement (precise or not) is a good measurement, as the mean value of the 
data is a correct estimate of the true mean. In this case, as we will shortly mention 
in Sect. 12.12, the error on the mean can be reduced by increasing the number of 
measures. 

For a single measurement, accuracy can be also defined as the difference between 
the single measured value and the true one, while precision, which is given by the 


12.5 Measurement Uncertainty 531 


precise but inaccurate 


accurate but imprecise precise and accurate 


Fig. 12.3 Representation of the effects of systematic (accuracy) and statistical (precision) errors 
in a measurement. The true value is represented by the target centre, while the measures are 
represented by the points 


dispersion of repeated measurements, loses its meaning in this case. Therefore, a 
single measurement will be accurate if, in the chosen measurement unit, it is close 
to the true value, not accurate if it is far from it. 

Another good example to understand the difference between accuracy and 
precision is given by the quartz watch: if the watch is set with “the exact time”, 
after some days it will differ slightly from this value, and we will have a precise and 
accurate time measurement. If, on the other hand, we set the clock 5 min ahead of 
the correct time, we will have a precise but not accurate measurement. 

Now suppose the exact time to be unknown or, which is the same, to remove the 
concentric rings centred on the true value in Fig. 12.3. In this case, we are able to 
judge if the measurement is precise, but not if it is accurate; in other words, the 
configurations of top and bottom lines of Fig. 12.3 will look alike. Knowing how 
accurate a measurement is would require to already know the true value, which is 
the purpose of the measurement! As all experimenters know, this is the greatest 
difficulty encountered in laboratory measurements. It is therefore necessary to very 
well know the experimental apparatus and the methods of data processing that are 
used, in order to be reasonably certain to have a priori removed the systematic 
effects or to know how to evaluate them. Later on in this chapter, we will give some 
examples and further explore these important aspects. 


532 12 Experimental Data Analysis 
12.6 Treatment of Systematic Effects 


Although systematic effects can have very different characteristics, it is possible to 
make a fairly general treatment of them, at least for the most common types. 

The first type of effect is due to the discretization operated by digital instruments, 
which is related to sensitivity. When 6, which denotes the accuracy error in 
Eq. (12.3), is negligible, the instrument sensitivity can be treated statistically, 
provided the standard deviation of the statistical uncertainties is much larger than 
A. In fact, we can rewrite the measured quantity as: 


Xp = wt Ret Ay =X, + Ak, (12.4) 


where yu is the true value, R; the random component and A, the effect due to the 
sensitivity. We can assume (R) = 0, because a value different from zero is included 
into the mean jz. Under these assumptions, one has (X : } =U. 

In digital instruments, A; represents the distance, for the kth measurement, 
between yz + Ry and the closest discrete value of the instrumental scale. Unlike 
the calibration error, which is independent of k (i.e. of the single measurement), this 
error varies for each data point and can be considered as a uniform random variable, 
since we supposed that o >> A. This is also the type of error that is introduced 
when constructing the histogram of a continuous variable, with the histogram bin 
smaller than the range of the data. Therefore, if Ax is similar to a rounding effect, 
the approximation (A) ~ 0 holds and hence: 


(X) =(X)=p, (12.5) 
so that the discretization effect of a continuous datum, typical of digital instruments 
and histograms, does not alter the average of the measurements. On the contrary, it 


has an effect on the dispersion of the measures. Assuming the uniform distribution 
of the systematic effect, we have in fact: 


Var[X] =o7 + —, (12.6) 


where A is the step of the instrument discrete scale or the histogram bin width. 
To obtain the dispersion of the data without the instrumental effect, the so-called 
Sheppard’s correction is often used: 


o? = Var[X]-—. (12.7) 


For histograms, the effect of the increased dispersion is shown in Fig. 12.4. We also 
recommend to solve Problem 12.13. 


12.6 Treatment of Systematic Effects 533 


<—_—_—> 


Fig. 12.4 The histogram of a continuous variable overestimates the dispersion of data when the 
population has a bell-shaped density as shown in the figure. In fact, the abscissa of the midpoint 
of the channel is attributed to all the events contained in the shaded area (which are the majority 
within the bin), even if it has a distance from the mean greater than the average distance of the 
shaded events 


We now come to the second type of systematic effect, called offset or zero-setting 
error. In this case, the observed random variable must be written as: 


Xik = Wt Rik +(S') +S; = Xi +(S') +S; , (12.8) 


where, as before, jz is the true mean, R is the random fluctuation and (s’ } + S$; is 
the systematic effect, written as an average value (s’ } plus a random part S; with 
null mean value. The indices denote the k replicates of the ith measurement carried 
out with the ith instrument or by the ith laboratory. Here the systematic error S; 
is the same for all the data of the same measurement or experiment and can be 
considered a random variable only if we consider the set of different laboratories or 
instruments measuring the same quantity. Before analysing the data, the systematic 
effect is corrected by subtracting the (S ' } value (which must be known) from all the 
data: 


Xik — (s’) = wt Rx t+ Si = Xi, 4 Si. (12.9) 


In the following this passage will be implied, and therefore, without loss of 
generality, we will set Xj, — (S’) — Xix and (S) = 0, transforming Eq. (12.8) 
into: 


Xp = e+Ra t+ S; = Xi, + 8; . (12.10) 


The presence of the term S$; creates a correlation among all data of the ith 
measurement. Since (R) = 0 by construction, from Eq.(12.10) it results in 
(X : } = (X) = yp; the experimental average, after the correction, is therefore a good 
estimator of the true mean ju. Since Rj, and S; are independent, from Eq. (12.8) the 


534 12 Experimental Data Analysis 


variance of this estimate is given by: 
2 2 2,A 
Var[ X] = Var[R] + Var[S] = of +o; > of + — , (12.11) 


where a is the variance of the non-systematic part. The last relation holds if the 
systematic errors are uniformly distributed with a total amplitude A, as is often the 
case. The validity of this formula can be verified with our routine MCsystems, 
which simulates a set of different Gaussian measurements of the same quantity 
LL, all carried out by different laboratories and with a uniform offset error. At the 
end a parameter called pool, i.e. the standard variable T, = (Xix — )/ Var[X] 
is calculated. The error handling is correct if 7, follows the standard Gaussian. 
The covariance between two measures X; and X2 must be calculated considering 
Xj, and X4, as independent variables but with the same systematic error. From 
Eq. (4.9) one has: 


Cov[X1, X2] = ((Ri + S)(R2 + S)) — (Ri +S) (Ro + S) 
= (Ri Ro) + (RiS) + (SRo) + (5°) — (Ri) (Ro) 
— (Ri) (S) — (R2) (S) = (8)? 
= (s?) = {5 aoh (12.12) 


since R;, R2 and S; are independent of each other and (R,) = (R2) = 0. The last 
term of the equation implies the average of S over all ith different measures, each 
with constant S;. Based on Eq. (4.31), this result shows that the systematic error 
maximizes the correlation among measures. 

We can generalize these results through Eq. (5.77). For example, if we have three 
statistically independent variables X/, X, and X4, with variances ae, os and ay, 
having a common systematic error Ogys; and another systematic error Osys2 affecting 
only the first two variables, the covariance matrix can be written, with obvious 


notation, as: 


2 2 2 2 2 2 
oO; + Osys1 + Osys2 Osysl + Osys2 Osysl 
= 2 2, 2 2 2 2 
V(X) = Fsyst + Fsysz 2 T+ Psys1 + %sys2 ~~ sys : (12.13) 
2 2 2 2 
Osysl sys 03 7 Osysl 


To calculate the variance of the sum or difference of two variables X;, i = 1, 2 with 
parameters j1;, oj and with X{ and X%, statistically independent, one can proceed 


12.6 Treatment of Systematic Effects 535 


directly without matrix notation. In fact, from Eqs. (12.11) and (12.12), one has: 


Var[X, + X2] = Var[X 1] + Var[X2] + 2 Cov[X1, X2] 


2 z 2 2 2 
=or + Osys + ox + Osys 7 2055 
=o) toy +4a,,,. (12.14) 


Var[X, — X2] = Var[X 1] + Var[X2] — 2 Cov[X1, X2] 
=o7 +03. (12.15) 


As is intuitive, the constant offset error increases with the sum of two variables, 
while it cancels out with subtraction. 

The third and last case we consider is that of systematic errors known as scaling 
or normalization, with a scale (or multiplier) factor multiplying all the values of the 
ith measurement by the same constant common factor. With the same notations of 
Eq. (12.8), we have: 


Xix = (S') Si(u + Rik) - (12.16) 


Here the correction of the systematic effects is applied by dividing the data by (S ‘ ). 
The analogous of Eq. (12.10) is then: 


Xix = Siu + Rik) = SiXix » (12.17) 


where the substitutions Xix/ (S") — Xj and (S) = 1, Var[S] = Oa have been 
done. After this correction, the systematic error affects the covariance matrix only. 


Since S and X’ are independent, one has: 
(X) = (SX) = (S)(X) = (XS =p, (12.18) 


from which we see that, after the correction, the data average is a good estimator 
of the true average. From Eqs. (5.69) and (12.16), one obtains the variance of the 
estimate: 


Var[X] = (u + Rik)? 02, + (8)? 0? 


= wos, +07 +o7o,,~ wo, +07, (12.19) 
where o? is the variance of the measure after the correction and the last term, 
according to Eq. (5.65), holds under linear approximation. As for Eq. (12.8), also 


this formula can be verified with our MCsystemp routine. 


536 12 Experimental Data Analysis 


Always within the linear approximation, the covariance between two different 
measures X, and X2 when the two variables X ‘1 and xX} are independent becomes: 


Cov[X1, X2] = (SX{S;X4) — (SX{) (SX) 
= (5°) (x1) (2) — (8)? (4) (X2) 
= (8?) = (8)9) (X1) (Xa) = wimzog5. (12.20) 


Finally, to evaluate the effect of the common scale systematic error on the product or 
on the ratio between X 1 and X2, we introduce two variables Z; = X;X2 and Z = 
X1/X2 and apply Eq. (5.77) by defining the transport matrix T and the covariance 
matrix V(X) based on Eqs. (12.18) and (12.19) as: 


(X2)  (X1) (X1)? os top (X1) (Xo) a5 
T= V(X) = 
(1/X2) (—X1/X3) (X1) (Xo) og (X2)? ogy +07 
(12.21) 


Now the product V(Z) = TV(X)T* must be computed. After a simple but 
somewhat lengthy calculation, the following matrix is obtained: 


2 
2 2hy 
Maat + wien + AmiMao%ys OL — 27> 
2 
V(Z) = ; : J |: (12.22) 
2 2hy Al 2hy 
CT 23 — a er 
My My My 


The diagonal elements give the variances of the product and division and the off- 
diagonal elements their covariance. From these results we see that the systematic 
error increases with the product, while the division does not contain the term ogys. 


12.7 Best Fit with Offset Systematic Errors 


As we have seen, the systematic error introduces a correlation between the 
experimental data, and we must therefore deal with a sample of non-independent 
measures. 

In this case the ML method can still be applied, and the likelihood function 
for correlated variables can be obtained using the product Theorem 1.2 and 
generalizing Eq. (1.21). Therefore, given a set Y;, Y2,...Y, of correlated measures, 


12.7 Best Fit with Offset Systematic Errors 537 


the likelihood function Leo can be written as: 


n 


Leon(8; y) = pO; yi) | | pO: yi, lyi-t. «+ 91) - (12.23) 
i=2 


In the following we will deal, for simplicity, only with the case of Gaussian statisti- 
cal errors and Gaussian systematic uncertainties. For a more general discussion, see 
[PS20]. 

To begin with, let us consider the case of offset systematic errors, for which 
Egs. (12.10)—(12.12) hold. Assuming Gaussian errors, Eq. (12.23) becomes nothing 
else than the product of a one-dimensional Gaussian density by Gaussian marginal 
distributions. Recalling the considerations made in Sect. 4.4, it is easy to conclude 
that Lor is simply a multivariate Gaussian function with correlated variables whose 
general expression is given by Eq. (4.69). 

From Eq. (12.13), the covariance matrix V can be immediately obtained; along 
the diagonal we have the quadratic sums of the statistical and systematic errors, 
while the other terms represent the square of the systematic error: 


2 2 2 2 
oy + Osys Osys eae Osys 
2 2 2 2 
Va] Ms 2 FO eys sys | (12.24) 
2 2 2 2 
Osys Osys pa on + Osys 


Let us denote with A the vector of the difference between experimental and 
theoretical values: 


118) — y1 
A= _ f (12.25) 
Mn (6) = Yn 
In matrix notation, the function to be minimized becomes: 
x70) = ATVIA. (12.26) 


As an example, let us consider a linear best-fit procedure, with j;(09,01) = 
8 + 01x;, applied to the points given in Eq. (12.27): 


x 2.0 40 60 8.0 10.0 12.0 
y 3.0 3.3 4.1 5.7 6.9 6.7 , (12.27) 
Ostat O05 O05 O05 O05 0.5 0.5 


538 12 Experimental Data Analysis 


Fig. 12.5 Fit with both 8 
statistical and systematic 
errors. The solid line is the 
result of the best-fit procedure 


0 2 4 6 8 10 12 


and also shown in Fig. 12.5. The vertical bars in this figure represent, as usual, the 
absolute statistical errors o; reported in Eq. (12.27). With the dashed lines, we have 
instead indicated a common systematic error Osyy = 0; = 0.5 which has been 
linearly added to the statistical ones. 

The solution to the minimization problem can be obtained in a way similar to 
that of Sect. 11.4: 


7 1 ; 
= = [xtv—'xyatv—ly)— atvo'xyatv—ly)), (12.28) 
platy Deatv'y)— atv tear vy), (12.29) 
where D = (1'V~!1)(x7'V~!x) — (1°V~!x)? and 1 is a column vector (with the 
correct dimension) of unit elements. 

If Ons = 0 and oa? — ae; then V = o? 1, and we get again exactly Eqs. (11.25) 
and (11.26), while, if On. = 0, we obtain the weighted LS estimates for the 


, iS positive, but oj = o, = 0.5, 


regression line. In our numerical example, oy 
and this implies to get the same 6 and 6; of the unweighted case. What has just 
been stated can be verified in the following way: if oj = o, for anyi = l,...,n 


and o,ys = O, then the element at position ij of V—! can be written as (Si; ie + faa): 


12.7 Best Fit with Offset Systematic Errors 539 


with a = 1/0? and: 


2 
ee i 
SYS 5 aD nD 
Oo; no? Oe. 


From Eqs. (12.28)—(12.29), after a little rearrangement, Eqs. (11.25)—-(11.26) are 
again obtained. However, the error of the estimates changes: as can also be 
intuitively understood, the inclusion of a same constant error common to all points 
has no influence on the error of the slope of the fitted line, but only on the 
parameter, whose estimate is then affected by a larger error than in the pure 
statistical case. Applying Eq. (5.73) to the estimators of Eqs. (11.25)—(11.26), that 
is, generalizing the error propagation formulae obtained in Sects. 11.4 and 11.5 to 
the case of a non-diagonal V matrix, it is easy to verify (see also [Bar89]) that the 
error on the parameter 6 remains unchanged, while the variance of 60 becomes: 


Var[o] = 7 2 Sxx = Saat a 2 Sxx — SyXi] [Sxx — Syxj] - 


(12.30) 


In this equation D = n S;, — S, i.e. it is equal to the sum of the variance obtained 
from the best-fit procedure with only statistical uncertainties and the additional 
source of variation due to the systematic error. 

In Table 12.1 (second row), the results thus obtained have been reported. They 
are compared with those obtained taking into account the random errors only (first 
row). When systematic errors are included, from Eq. (12.30) we have that = 


0.472 + 0.52. 

The equations described above can be solved with our FitMat routine, which 
provides as input to the R routine opt im the function (12.26) to be minimized, with 
the possibility to have a non-diagonal covariance matrix. The results of Table 12.1 
have been obtained with the instructions: 


>xxX <- ¢(2,4,6,8,10,12) 

sy 25 ¢(3'.0;3.3;,4.1,5.'7,6.5.06:.,7) 

>varmat <- matrix(rep(0.25,36),ncol=6) # fill cov mat with 0.25 
>diag(varmat) <- 0.5 # fill diagonal with 0.25+0.25 

>£ <- function (par,xx) {par[1]+par[2]+*xx} 

>FitMat (xx,y,varmat,parf=c(1,0.5),fun=f) 


where the initial values of the parameters to be fitted are contained in the vector 
parf. 


540 12 Experimental Data Analysis 


Table 12.1 Best-fit results for the data of Fig. 12.5. The result considering statistical errors only 
(first row) is compared with that obtained with an additive constant systematic error (offset error, 
second row) and a constant multiplicative systematic error (scale error, third row) 


Error CaN (cal f 

stat. 1.86 + 0.47 0.44 + 0.06 

stat.+sys. add. (0.5) 1.86 + 0.69 0.44 + 0.06 

stat.tsys. mult. (20%) 1.86 + 0.60 0.44 + 0.11 1.0+0.2 


The application of Eq. (5.73) gives exact results for the linear least squares and, 
in general, is valid for “small” systematic errors: 


2 
sys 


= 2 
Oo; oo Osys 


oF 
<K1. 


To solve the general case of large systematic errors in a statistically correct way, 
we suggest to consult [HLO7, PS20]. Finally, for an in-depth study on mixed linear 
models and for the estimation procedures with unknown oF and Ce the reference 
is still [Dav08]. 


12.8 Best Fit with Scale Systematic Errors 


As we mentioned earlier, systematic uncertainties can appear not only as offset 
values but also as fractions or percentages of the measured values. This, for example, 
is the case when the number of events that are registered by a detector not having 
100% efficiency has to be multiplied by a correction factor. 

In this situation, described in Eqs.(12.16)—(12.19), it must be considered that S 
affects not only y; but also o; since both these parameters have been obtained by 
multiplying the raw data by the same common scale factor. Furthermore, the effect 
of the systematic error is now intrinsically nonlinear. For this reason, as shown in 
[D’A94], the matrix covariance formalism under linear approximation, applied in 
the previous section to additive systematic errors, leads now to biased results even in 
the presence of not very large scale errors. This situation is, for example, described 
in Problem 12.18. 

A way to avoid these difficulties is to introduce, in addition to 6, a further 
parameter a in the best-fit function, so as to have 2(6, a). In the present case, we 
can then write 4(6,a) = a(@), since the effect of the systematic multiplicative 
errors is to introduce a scale constant factor to all theoretical values. The p.d.f. of 


12.8 Best Fit with Scale Systematic Errors 541 


the generic variable y; can then be written as: 


Sie _ 10% = an()y* 
pi 8. f) = ez o( — 


2 il _1 fyi — HOY 
= v0 —_— (12.31) 


where f = 1/a is the factor which simultaneously multiplies both y; and o; to take 
systematic errors into account. 

Assuming f to follow the normal distribution f ~ N(1, Baie) Eq. (12.23) can 
be written as: 


= I fyi uO? || 1 (=P 
too T eae are | ae a 
(12.32) 


Since the multiplicative factors in front of the exponentials have constant values, the 
negative logarithm of the function to be minimized becomes [D’ A94]: 


n 


x°(63 f) =-2In LO, f) => 


i=1 


2 2 

(fyi — H@)) x Geely. (12.33) 
‘i aa, Ons 
Here the standard formula (11.1) is modified by the presence of the term (f — 
1)7/ (O55) which takes into account the effects due to the systematic multiplicative 
uncertainty. By carefully considering Eq. (12.31), one sees that it has a parametriza- 
tion different from the Gaussian density that would result from the application of 
Eq. (12.17), which would provide the relation Y = a(j(@) + R). Furthermore, the 
x? of Eq. (12.33) will be minimized with respect to @, a parameter, and f, which is 
a random variable. This procedure is allowed in the Bayesian approach. For further 
information, we refer to [D’A94]. 

Applying this procedure to the points of Fig. 12.5 and now assuming a 20% 
systematic error, we obtain the results reported in Table 12.1. As in the previous 
case, the values of om and 6 1 do not change, whereas the error on both these estimates 
does change because the absolute value of the systematic error now varies point by 
point. 

The results shown in the table can be obtained from the FitMat routine 
requesting the minimization of x* and providing as input the varmat matrix in 
a diagonal form and the systematic error value via the sys variable: 


FitMat (xx,y,varmat, parf=c(1,0.5),type=’CHIS’ ,sys=0.2) 


542 12 Experimental Data Analysis 
12.9 Indirect Measurements and Error Propagation 


A quantity is said to be measured indirectly when it is a function z = f(x, y, w,...) 
of one or more directly measured quantities affected by uncertainties. The deter- 
mination of the uncertainty on z starting from that of the measured quantities 
is called error propagation. In the following we will start to describe the simple 
case z = f(x, y) which can be easily extended to any number of variables. If 
x and y are independent and only affected by statistical errors, Eq. (5.65) or its 
generalization (5.72) to n variables should be used to implement this procedure, 
after the substitution of the standard deviations with the measurement errors s, and 


Sy: 
af? ary 
2 2 2 
sp = () se + (Z ee (12.34) 
Obviously, with n independent measures, one has: 
n 2 
d 
c=)) ($5) s*(xi) . (12.35) 


i=l 


This equation, which is the well-known error propagation law, is exact only for 
linear transformations. In this case, the resulting standard deviation estimate defines 
a Gaussian confidence interval only if all variables are Gaussian, as shown in 
Exercise 5.3. However, the intervals tend to be approximately Gaussian even when 
a nonlinear function f depends on a large number (>5 — 10) of random variables. 
Moreover, as extensively discussed in Chap.5 and in particular in Sect.5.4, in 
general Eq. (12.35) gives reliable results also in case of small relative errors. 

We now come to the instrumental uncertainties, which a uniform density can 
be often attributed to, as shown in Eq. (12.2). In this case, for example, with two 
measures, an error propagation law determined from the first-order Taylor expansion 
and with the derivative absolute value is sometimes used: 


P F 
Af= Ha.4 Ha, (12.36) 


The generalization for n measures obviously is: 


n 


Apa > 


i=1 


Aj . (12.37) 


Xi 


These formulas do not represent the standard deviation of the resulting distribution, 
but its total width when /f is a linear function. For this reason they are generally 
used to estimate the upper error limit, which can be useful in the case of correlated 


12.9 Indirect Measurements and Error Propagation 543 


quantities with unknown covariances. The absolute value of the derivatives ensures 
that the uncertainty propagation is always calculated in the most “unfavourable” 
way, which corresponds to an increase in the measurement error. 

If the variables are uncorrelated, then a more correct way to proceed is to 
apply Eqs. (12.34) and (12.35) with a uniform density, whose variance, based on 
Eq. (3.82), is A*/12. For two and n variables we have, respectively: 


af” af)’ 
se =y (x) A + (=) a3] : (12.38) 
n 2 
2 3 (=) a2, (12.39) 


with y = 1/12. Therefore, to obtain the total variance, it is necessary to 
quadratically sum up the systematic errors and then to divide by 12. The variance 
has been indicated here with Latin letters, as it is still a parameter estimated from 
the data under the a priori uniform density assumption for the systematic effects. 
However, these variances should not be associated with the Gaussian density, even 
if the function f(x, y) combines the variables linearly. In fact, while the linear 
combination of Gaussian errors leads to Gaussian confidence intervals, in this case 
the sum of two or more uniform systematic errors leads to different densities, 
which depend on the number of summed errors. However, from the Central Limit 
Theorem 3.1, we know that these densities rapidly tend to a Gaussian. The practical 
problem is now to evaluate the number of linearly combined measures from which 
it is reasonable to use the Gaussian approximation. The result we will find is quite 
surprising. 

We solve this problem by treating in a complete way the instructive case of 
the sum of two systematic errors of equal value. The results of Exercise 5.2 
indicates that the sum of two equal uniform variables defined in [0, A/2] follows 
the triangular density in [0, A]: 


4 f Gt A 
Aa or S*S5 
—~) 4 A 
Pp) = —x (A — x) for —<x<A (12.40) 
A? 2 
0 otherwise, 
with parameters: 
= ee é (12.41) 
=r, oOo => az, o=— =z. . 
peg 24 Jb 


544 12 Experimental Data Analysis 


Fig. 12.6 Triangular density 1 
given by the sum of two 
measures affected by the 
same systematic error 
uniformly distributed in 

[0, 1]. The shaded area is the 
confidence level 
corresponding to the interval 
fto and is + 65% 


0 1 - 0.408 1 1+ 0.408 2 
This distribution is shown in Fig. 12.6 when A /2 = 1. If the two summed systematic 
errors are defined on [0, 1], Eq. (12.36) provides the value: 
Af=A,tAy=2, 


which is just the total width of the distribution. From Eq. (12.38) we have instead: 


A2 + A2 = — = 0.408, (12.42) 


1 1 
We — 
a v6 
which is exactly the same value given by Eq. (12.41), since we are considering the 


sum of two errors. The area corresponding to the interval |x — w| < Ko, with K 
real number, can be directly read from Fig. 12.6 or evaluated as: 


a ut+Ko 
P{|X —p| < Ko) = | vax + f (2 — x) dx (12.43) 
a 


u-—Ko 
0.649 for K = 1 
= anil = cae = 10,966 for K =2 
P - 1 for K =3 


where in the last step the value of Eq. (12.42),a0 ~ s¢ = 0.408, was used. These 
values are very close to the Gaussian probabilities of Eq. (3.35). The surprising 
fact is just that usually experimenters, given this type of results, assume Gaussian 
confidence intervals already when errors are obtained from the linear combination 
of only two systematic errors. This result also justifies the assumption of a triangular 
density for certain types of systematic errors, which come from experimental tests 
or simulations that combine a few systematic errors. In this case, in Eqs. (12.38) and 
(12.39), a value y = 1/24 should be used, according to Eq. (12.41). 

We now work out the problem of combining statistical and systematic errors. In 
the simplest case given by the linear superposition of two independent measures, 


12.9 Indirect Measurements and Error Propagation 545 


Table 12.2 Confidence levels of 1, 2,30 intervals for measurements where statistical Gaussian 
errors and systematic uniform errors are linearly combined. The intervals are parametrized as a 
function of the ratio A/o and of the standard deviation o,, = a2 + A2/12, where o is the 
statistical error and A is the total range of the systematic error 


A/o +o +20,, +30; 
1.0 | 68.3 95.4 99.7 
3.0 67.1 95.8 99.8 
5.0 64.9 96.7 100.0 

10.0 60.9 98.6 100.0 

100.0 57.8 | 100.0 100.0 


one with Gaussian statistical error and the other with uniform systematic error 
within an interval A, the probability density of the result can be easily obtained 
from formula (5.34) derived in Exercise 5.1. 

If we consider the systematic error within an interval (a = —A/2, b=+A/2), 
and introduce the standard variable t = (z — y)/o, where a is in this case the true 
statistical error, from Eqs. (3.40) and (3.44) the formula (5.34) becomes: 


yee | oem Oe] pee 12.44 
rto=5 [8(+ 35) -B(- 35) ea 


This density is an even function with respect to the origin, since it is the convolution 
of two even functions, and has a standard deviation given by: 


Om =+/o2 + A2/12. (12.45) 


The corresponding probability levels can be evaluated by integrating Eq. (12.44). 
Table 12.2 shows the results obtained with our routine Stasys(t,sigma, 
delta), where t is the quantile multiplying the error. 

This table shows that, foro = A,i.e. fora > A/ 12, the results coincide with 
the Gaussian levels (first row), while, foro <«< A, they tend to those of the uniform 
distribution (last row). In the intermediate cases, all in all, the results do not differ 
much from the standard Gaussian levels. 

The results calculated with Eqs. (12.43) and (12.44) can be obtained with a few 
simulation lines, such as the following, in which o = | and a systematic range 
A = 3o are considered: 


> vec <- rnorm(10000) + runif (10000,min=-1.5,max=1.5) 
> error = sqrt(1 + 9/12) 
> 1- (length (vec [vec< (-error) ])+length (vec [vec>error] )) /length (vec) 


The obtained results coincide, within the statistical error of the simulation, with 
those of Table 12.2. 

In general, the linear propagation method provides good results even in the 
case of measurement products or ratios. In this case Eq. (12.35) gives the linear 


546 12 Experimental Data Analysis 


propagation of percentage errors, which is often used. Indeed, recalling Eq. (5.68), 
if we substitute the true standard deviations with statistical errors, we can write, with 
obvious notation: 


XxX] X?2 s2 Sf se 
224 =. FS Sa142. (12.46) 
2 X| Zz eM 


This property can be also easily derived from Eq. (12.34) and can be immediately 
generalized to the case of n variables. Basically, percentage or relative variances 
are added together both in the product and in the division. The same property holds 
also for the maximum errors of Eq. (12.36): 


X{ X2 A, A\ A? 
Z=X,X21,Z2=—,Z2=—- => =—+—. (12.47) 
X2 X z XxX} x2 


However, both the quadratic propagation of relative statistical errors (12.46) and the 
linear propagation of maximum systematic errors (12.47) should be used cautiously, 
because they are valid only if the measures x; and x2 are independent. For example, 
it is easy to see that Eqs. (12.34) and (12.46) give different results when propagating 
the error for the ratio x /(x + y). In this case only Eq. (12.34) is correct, because the 
presence of the variable x both in the numerator and in the denominator induces a 
correlation in their ratio, even if the measures x and y are independent. 


Exercise 12.1 
The measurement of the sides of a metal plate with a carpenter metre with 
millimetre marks provided the values: 


b=255+05 mm, h#=345+0.55 mm. 
Find the value of the plate area A. 


Answer A 1-mm systematic error (centred about the mean value of the 
interval which covers the measured value) has been attributed to the side 
measurements. According to Eq.(12.2), in this case one assumes that the 
true value is within the observed interval of width 2-0.5 = 1mm with 
CL = 100% and probability given by the uniform distribution. Therefore, 
the standard deviation of the measures is given by: 


2-0.5 
se ae er me aed mm. 


(continued) 


12.9 Indirect Measurements and Error Propagation 547 


Exercise 12.1 (continued) 
The relative deviations are: 


Sb 0.289 Sh 0.289 
= = SS SOON iil (8 ~ (0), 3 
b 25.5 O70 a h 34.5 CONES UB 


From Eq. (12.46), the relative error on the area A = bh results: 


A 
= = V (0.011)? + (0.008)? = 0.014 = 1.4%. 


Since obviously one has A = bh = 25.5-34.5 = 879.75 mm”, the final result 
is: 


A = 880 +0.014- 880 = 880+ 12 mm’. 


Even in the simple cases that we have just developed in detail, the error 
propagation with analytical methods appears to be quite laborious. However, it 
can be replaced by the simpler direct computer simulation. The most common 
technique is the approximate algorithm described in Sect. 8.10 as the bootstrap 
method. Measured values and their errors are assumed to be the true distribution 
means and standard deviations; random variables are then computer sampled from 
these distributions and are combined as prescribed by the measurement, to obtain 
the simulated histogram of the final quantity (or sets of histograms, in the case of 
complex measurements). 

The shape of the histogram gives an approximate evaluation of the probability 
density of the result, while the measurement errors are directly obtained as limits 
of histogram areas corresponding to the assigned confidence levels. In general, the 
central histogram values coincide or are very close, within the statistical error, to the 
measured values and therefore do not provide new information. In more complicated 
cases, which involve densities (usually multidimensional) not possessing symmetry 
properties such as Eq. (6.10), it is necessary to abandon the approximate bootstrap 
method and to use the rigorous Neyman’s method, which implies the generation 
of simulated data for all possible values of the distribution and the assessment of 
the confidence regions as described in Sect. 8.10. These techniques are discussed in 
detail in [FC98, JLPe00]. 

The following example contains all the elements to understand the Monte Carlo 
techniques applied to the error propagation. Its simplicity is by no means limiting, 
since simulation methods have the great advantage of maintaining basically the 
same level of logical complication, regardless of the complexity of the analysed 
problem. 


548 12 Experimental Data Analysis 


Exercise 12.2 

An experimenter throws a stone into a well and with a manual stopwatch 
measures a 3-second falling time. By attributing to this measurement an 
uncertainty of +0.25 s, determine the depth of the well (neglecting the effects 
due to the sound speed). 


Answer The central value of the well depth is calculated with the well-known 
law of falling bodies: 


15 m 2 
p= et — 59:81, 9s = 44.14m, (12.48) 
Ss 


where g = 9.81 m/s? is the acceleration of gravity. 

If we attribute a uniform distribution to the error inherent to the use of 
a manual stopwatch, the uncertainty of +0.25 s corresponds to a standard 
deviation value: 


0.5 
ee ae ae 
ae 


Then, the measurement error can be evaluated with Eq. (12.34) for the one- 
variable case: 
dl 


Gy = (=)» = gts, =9.81-3-0.144=4.24m. 


According to the analytical error propagation, the value of the well depth 
therefore lies in the interval: 


1 = (39.90, 48.38) =44.14+4.2m. (12.49) 


What is the confidence level of this interval? If we denote with t the true 
value of the fall time, after measuring t = 3s with A = 0.5 s, we know that 
2.75 < t < 3.25s with 100% probability. This value corresponds to a true 
length A within the range: 


t= /2\/g => 37.09<A<51.81. 


This interval contains the values of the 1 parameter that, after the solution of 
the integrals (6.7), give the limits (A, A2), of a specific confidence interval. If 
we search for the symmetric interval with CL = 68.3%, Eq. (6.7) becomes: 


lee) i) 
i Pr(t; A1) dt = 0.158 , / Pr(t; Az) dt = 0.158 , (12.50) 
10 —0o 


(continued) 


12.9 Indirect Measurements and Error Propagation 549 


Exercise 12.2 (continued) 

where fo = 3s is the measured time. These integrals can be easily solved if 
one finds the cumulative of p;(t). If we consider values T ~ U(t — 0.25, T+ 
0.25), we immediately obtain: 


P{T <t} =2¢ —c +0.25) = 2(¢ — ./2A/g +0.25), 


= 2 2 
g(t He! Be ea) . 


Writing this equation with respect to A, we can solve the integrals (12.50) with 
the simple condition: 


a =F (1 = 40.25)" (12.51) 
(inal 5 0 5) . , : 
where the values a = 0.158 and a = 0.841 give the limits of the interval 
(1, A2) with CL = 68.3%. If we denote by / the true value of the length (this 
is the usual laboratory notation), we thus obtain the interval: 


I = (39.28, 49.29) = 44.1732 | CL = 68.3%, (12.52) 


which is significantly larger than the approximate interval (12.49). 

It is also useful to compare this result with that obtained from the 
simulation. From the bootstrap method, the fall times are randomly generated 
from the uniform density with mean equal to the measured value. The 
histogram with N = 20000 simulated measured lengths (evaluated with 
Eq. (12.48)) is shown in Fig. 12.7. 

In this simple case, the simulation would not be necessary, because 
the derivation of the cumulative of length from the cumulative of time is 
straightforward: 


(Ge 12D, iY ie {507 < i = Pil =) 27g) 


= UA =e 42 O23) « (12.53) 


I 2 1 2B 
58(t — 0.25) <1 < sat + 0.25)’, 


and then to get, from its derivative, the p.d.f. of /: 


(i; t) = eee ( Ceres & (rt +0.25)7 (12.54) 
a) — F T : T : ; 2 


(continued) 


550 


12 Experimental Data Analysis 


Exercise 12.2 (continued) 
The form of this function depends on t through the limits of the definition 
range. 

Let us now proceed to analyse the simulation results. Since the histogram 
contains 20,000 events, from Eq. (6.50) and from the data of the figure, the 
estimation interval of the mean is obtained: 


4.22 
pe = 44.19 + ——— = 44.19+ 0.03 , 
20000 


which is a result in agreement with the measured value of 44.14 metres. Given 
the density asymmetry of Fig. 12.7 and the nonlinearity of the physical law 
used, it is reasonable to expect a small deviation between the mean value and 
the measured one. 

If we analyse the areas of the histogram centred around the measured value, 
it results that, as shown in Fig. 12.7, 68.3% of the depth values are within: 


I = [39.24, 49.24] ~ 44.173-3m (CL = 0.68), (12.55) 


a result coincident with the correct one (12.52). Figure 12.7 shows that the 
confidence levels are not Gaussian at all. We could say in this case that 
the measurement result is represented more precisely by Fig. 12.7 than by 


Eq. (12.55). 


Fig. 12.7 Computer 


simulation of N = 20000 ont 
lengths / = (1/2) gt? where 

g = 9.81 m/s? and tisa 

measured time of 3 s, 250 


distributed as the uniform 
density with A = 0.5 s. The 
area between the two dashed 


200 
lines, to the left and to the 
right of the measured value, 
contains 68% of the 

150 


histogrammed events 
(lengths). The density shape 
(12.54) is well approximated 


with the best-fit straight line 100 

f@ = 343 — 2.51 shown as 

a solid line. By dividing this i i H 
function by the histogram bin 59 44.14-4.9 44.14+5.1 


width Al = 0.17, the density 
Np(l) = 2018 — 14.701 is 
obtained, which allows the 
calculation of the areas under 
the curve (confidence levels) 


n=20000 


: \ : : 
40 42.5 45 47.5 50 52.5 


distance (m) 


12.10 Measurement Types 551 


The long discussion of this section can therefore be summarized in the following 
points: 


¢ To evaluate statistical errors, it is necessary to estimate the sample variances and 
to apply Eq. (12.35). If errors are Gaussians, the results often follow the Gaussian 
density (since a linear combination of the effects is usually assumed), and the 3a 
law (3.35) still holds. 

¢ To evaluate systematic errors, Eq. (12.39) should be applied, and the resulting 
standard deviation often follows approximately Gaussian confidence levels. 
Equation (12.37) instead defines a maximum error, which must not be combined 
with other quantities representing estimated standard deviations. 

¢ The combination of systematic and statistical errors must always be done in 
quadrature, using Eq. (12.35), where the variance of the systematic effects must 
be calculated with Eq. (12.39) (do not forget the 1/12 or 1/24 factors). The result- 
ing standard deviation does not follow the 30 law, because the corresponding 
density is not Gaussian (see Eq. (12.44)). However, Gaussian confidence levels 
are often assumed in practice. This assumption is generally all the more true the 
higher the number of combined errors is. 

e The analytical procedure can lead to considerable inaccuracies in the case of 
large, correlated errors or of errors to be combined nonlinearly. In these cases 
simulation methods have to be used, since they usually allow us to solve any error 
propagation problem in a complete and satisfactory way. Thanks to simulation 
methods, in recent years the results provided by experimental physics have 
remarkably improved their precision, accuracy and reliability. 


12.10 Measurement Types 


The scheme of Fig. 12.1 suggests to classify measurements as represented in 
Fig. 12.8. This is our personal notation, you will not find it in other texts. The 
following examples clarify its meaning: 


— M(0O, 0, A) = measurement of a constant physical quantity with systematic 
errors 

— M(0, o, 0) = measurement of a constant physical quantity with statistical errors 

— M(f, o, A) = measurement of a variable physical quantity in the presence of 
both statistical and systematic errors 


Since each of the three symbols of the notation can assume two values, in total there 
are (2 x 2 x 2 = 8) different types of measurements. However, the case M(0, 0, 0), 
which refers to the error-free measurement of a fixed quantity, represents an ideal 
case without interest in this context. Therefore, in practice seven different types of 
measurement must be considered, that will be detailed in the next sections. 


552 12 Experimental Data Analysis 


) = absent oOo present 


\ / 


M(quantity, statistical errors, systematic errors) 


0= A | 0 = absent x 


f = variable A = present 


Fig. 12.8 Classification of the possible types of measurement 


12.11 M(0, 0, A) Measurements 


In this case, a constant quantity is measured, and the statistical errors are not present, 
because they are either totally absent or much smaller than the systematic error of 
width A. 

This measure has no fluctuations, and all repeated measurements provide the 
same value x. The result is usually presented under the form (12.2): 


ge a (CL = 100%). (12.56) 


The error here has the meaning of maximum error, and the interval covers the true 
value with a 100% probability, that is, with certainty. 

Often, based on the arguments of Sect. 12.4, systematic errors are associated with 
a uniform density. In this case the variance of the measure is given by (3.82): 


A*(x) 
2 
= , 12.57 
s“(x) D ( ) 
whereas the standard deviation is usually written as: 
A(x) 1 U 
s(x) = = (12.58) 


2 a 3" 
where U = A/2 is sometimes known as measure uncertainty. 
The error deriving from the linear combination of several measures of this type 


must be propagated with Eqs. (12.38) and (12.39), and the confidence intervals tend 
to rapidly become Gaussian as the number of measures increases. 


12.12 M(0, o, 0) Measurements 553 


Exercise 12.3 
The measure of an electrical resistance with a digital multimetre provided the 
value: 


R= 224 © , 
The “table of accuracy” of the instrument booklet gives an accuracy of: 
+ (0.1% rdg + 1 dgt) 
for resistance measurements. Find the result of the measurement. 


Answer This is a measurement where only systematic errors are present as 
the statistical fluctuations due to time variations of the resistance value are 
below the multimetre sensitivity. 

The instrument accuracy is 0.1% of the reading, (rdg) with the addition of 
one unit of the last right digit reported on the display (dgt), which in our case 
is 0.1 82. Therefore, we have: 


A(R) = (235.4- 0.001 + 0.1) = +(0.2 4+ 0.1) = +0.3 2 
The result of the measurement is then: 
IR = D359 422 0.3 LP » 


with 100% confidence level. 


12.12 M(0, o, 0) Measurements 


In this case, a constant physical quantity is still measured, but with a sensitivity 
interval of the apparatus much smaller than the statistical errors: s >> A. 

Repeated measurements give different values, which generally, but not always, 
are distributed according to the Gaussian density. A sample of N measurements is 
thus obtained, from which the average m and the standard deviation s are calculated. 
Since, in the absence of systematic errors, it is assumed that the true value of the 
physical quantity coincides with the mean of the distribution of the measures (true 
mean), based on Eq. (6.50) the result of the measurement must be presented in the 
form: 


S 
+ — (CL ~ 68% if N > 10), (12.59) 
VN 


554 12 Experimental Data Analysis 


which must be associated with a Student’s confidence interval (for Gaussian data) 
or to a Gaussian one if N > 100 (for any data). This type of measurements then 
tends to have zero error. 

The starting hypothesis, however, is to have a perfect instrument, that is, with 
A = 0. In practice, the results are presented under the form (12.59) when the sample 
size is not sufficient to obtain a precision that can compete with the accuracy of the 
apparatus, i.e. when: 


Ss 
sz >A. 
VN 


We emphasize an important point: the interval (12.59) is different from the interval 


mts, 


which is an estimate of the interval 44 + o, giving the probability to obtain a 
single measurement. Indeed, according to Definition 12.6, the standard deviation 
s is an estimate of the precision of a single measurement (measurement error), 
while the quantity s/ JN represents the precision of the global measurement. From 
Eq. (12.59) it results also that, if two measurements M, and M2 have different errors, 
it is possible to obtain the same final precision from both of them if Nj and N2 obey 
to the relation: 


2 
NM Sy 


—= 4. 
No 85 


If, for example, 5s} > s2, then Nj > No, that is, the number of measures of the 
experiment with the larger error must be larger (see again Fig. 12.3). 

If the x; measures to be averaged come from different experiments, they will have 
different precisions s;, and then the weighted mean formula (10.70) must be used 
with m; = x;,n; = 1. In practice, a likelihood function must be considered here as 
the product of N Gaussian measures, which provided the results x; + s;. With the 
approximation s; ~ oj, we can write the probability of obtaining the observed result 
according to the likelihood (10.49): 


” 1 1 (y= 6)? 
Lu; x) =|] le exp (- i] (12.60) 


i=1 


The maximization of this likelihood is equivalent to the logarithmic negative 
likelihood (10.6) minimization that in this case coincides with the least squares 
minimization: 


” 1 1-2)? 
—InL =f =— 1 = ——.. 12.61 
nL(u) = L(u) rm ma) tae - (12.61) 


12.12 M(0, o, 0) Measurements 555 


To find the point of minimum /2, we set dL(j1)/du = dx7/du = 0, and, with the 
same procedure of Sect. 10.8, we obtain the result: 


N 
be 


t Var[ fi] = =! 


N 
ei 
i=1 


& 
ll 
> 
im 


1 
. p=, (12.62) 


It should be remembered that this formula is strictly valid only for Gaussian 
measurements, where the interval (12.62) has a 68% coverage. However, even if 
the data were not Gaussian, for N > 10 the Central Limit Theorem holds, as in the 
case of Eq. (12.59). 


Exercise 12.4 
A series of measurements of the speed of light taken by different groups gave 
the following means: 


c) = 2.99 + 0.03 10!° cm/s 
c2 = 2.996 + 0.001 us 
c3 = 2.991 + 0.005 ui 
c4 = 2.97 + 0.02 ss 
c5 = 2.973 + 0.002 Ms 


From these data, determine the best estimate of the light speed. 


Answer Since the fifth datum is incompatible with the first three, the most 
likely hypothesis is that the experimenter made a mistake and that only the 
first four data are correct. 

The first four measures have weights: 


pi = 1111. 
= 10 

p3 = 40000. 
pa = 2500. 


4 


>> pi = 1043611. 
iil 


(continued) 


556 12 Experimental Data Analysis 


Exercise 12.4 (continued) 
Note the higher values of the weights associated with the more precise data. 
Applying Eq. (12.62) we obtain: 


c = (2.99574 + 0.00098) 10!° cm/s . (12.63) 


Here it is worth noting a general fact: the error of the weighted average is 
always lower than the smallest error in the original data. The result can also 
be presented in the rounded form: 


c = (2.996 £ 0.001) 10!° cm/s , 


which is identical to the result cz of the second measurement. 

If we had not used the weighted average, we could have used (wrongly) 
the unweighted formula (12.59). Since the standard deviation of the first four 
data, based on the second of Eqs. (6.54), is: 


4 
Yi - oF 


p=il 


s(c) = : = 0.01147 10!° cm/s , 
we would have obtained: 
s(c) 10 
c = 2.98675 + —= = (2.9867 + 0.0057) 10° cm/s , 


va 


which is a result quite different from the correct one (12.63). 


12.13 M(0, o, A) Measurements 


Here a constant physical quantity is measured with an apparatus where both 
statistical and systematic errors are relevant. 

When A is the sensitivity interval, repeated measurements provide different 
values, and the data sample is typically displayed in the histogram form. For obvious 
reasons, it makes no sense to choose the histogram bin width smaller than sensitivity 
interval A. Histograms can have many or few bins, depending on whether s > A, 
s ~ Aors < A, where s is the standard deviation due to the random effects (see 
Fig. 12.9). This class of measurements also includes those of the type M(0, o, 0) 
when the data are displayed in a histogram with bin width A or measured with 
an instrument of equal sensitivity. This case corresponds to the systematic error 


12.13. M(0, o, A) Measurements 557 


s>>A SA s<A 


Fig. 12.9 Different types of histograms of bin width A when data have a statistical error s 


of the first type discussed in Sect. 12.6, which requires only the application of 
Sheppard’s correction (12.7) to the variance. However, it is useful to note that 
there is no complete agreement among researchers on the use or not of Sheppard’s 
correction (indeed, we believe that many of them even ignore its existence ...). In 
fact, the correction is valid only for Gaussian or bell-shaped distributions like that of 
Fig. 12.4, while for other different distributions, it generally tends to underestimate 
the variance. 

Following the classification of Sect. 12.6, let us now consider the systematic 
effects of the second type, named offset or additive errors. 

In the recent scientific literature, the measurements of the type M(O, 0 , A) with 
offset systematic errors are often reported in the form: 


: A (stat) + . (syst) , (12.64) 


where s is the standard deviation of the sample of N data and A is the uncertainty 
range of the systematic effect correction. The mean m is calculated from the 
observed mean m’ corrected by the systematic effects, m = m’ — c, where c 
is the best estimate of the offset value. The standard deviation s is estimated 
directly from the sample or after Sheppard’s correction (12.7). Sometimes more 
sophisticated methods are used to find s(stat) and A(sys), such as the simulation 
methods described in Sect. 12.9. 

In Eq. (12.64) the confidence level remains undetermined, because the systematic 
errors are reported with CL = 100%, whereas the statistical ones are usually given 
with CL ~ 68% (if they are Gaussian). The way of combining the two errors is 
somewhat arbitrary but is generally the one suggested by Eq. (12.11), considering 
the term o? as the variance of the mean of the observed data: 


2 A2 
=mto,=m+,/—+—. 12.65 
x=mto,=m N + D ( ) 


558 12 Experimental Data Analysis 


In this case the confidence intervals are approximately Gaussian and are given by 
Eq. (12.44) and in Table 12.2. 

The third case considered in Sect. 12.6 is that related to multiplicative systematic 
effects, due to scaling or normalization factors. If c is the correction factor of the 
effect, all data must be divided by this factor. What remains is an uncertainty to 
which can be attributed a uniform distribution of amplitude +A/2 and average 
ta = 1. After this correction, and using Eq. (12.19), the result of the measurement 
can be written as: 


g2 goer s2 A2 
one sl ge 12. 
pees | eae ae a a ene) 


where s* is the variance of the data sample after the systematic effect correction. 
The last term under the square root is usually negligible. 

As you can see, the procedure that combines statistical and systematic errors 
is neither univocal nor free from ambiguities. However, there are cases where the 
method does not present difficulties. A good example is the discovery of the nuclear 
particle Z°, which gave to some of its finders the Nobel Prize in Physics in 1984. 
The two experiments (called UA1 and UA2), which simultaneously discovered the 
particle at the European CERN laboratories of Geneva, measured the following mass 
values, expressed in Giga electron-volts (GeV)! [Col83a, Col83b]: 


Mz = 95.2 + 2.5(stat) + 2.8(syst) GeV (UAL) 

Mz = 91.9+1.3 (stat) + 1.4 (syst) GeV (UA2) , 
where, in both cases, the systematic error derives from the uncertainty in the 
absolute calibration in energy of the apparatus. On the other hand, the theory 


predicted the value 


92.3 + 0.7 GeV , 


where the error is due to the approximations used in the calculations. 

In this case, the excellent agreement between theory and experiments is evident, 
regardless of the specific techniques of data analysis and error handling that could 
be possibly used. 


12.14 M(f, 0, 0) Measurements 


Here we refer to the case of a random variable measured without any error. 


' GeV is an energy unit used in particle physics and is equal to 1.6 - 107!° J. 


12.14 MCf, 0, 0) Measurements 559 


The purpose of the measurement is the determination of the distribution or 
statistical law that determines the considered physical phenomenon. Sometimes it 
may be sufficient to determine mean and dispersion of this distribution, while in 
other cases, it is essential to know its precise functional form. For the latter case, 
think about the Maxwell and Boltzmann densities, which are the basis of statistical 
mechanics. 

Basically, all the stochastic phenomena discussed in the previous chapters belong 
to this class of measurements and observations, and the methods to determine 
of means, variances and functional forms are precisely those that have been 
extensively described in this text. As a significant example, just remember the 10- 
coin experiment, which has been analysed in Exercise 10.7. 

In physical sciences, this type of measurement includes all counting experiments, 
for which, as discussed in Sect. 3.7, the Poisson distribution plays a fundamental 
role. 

In these cases, it often happens to first count N;+p» events from a source within a 
time period of length ¢,; then, after removing the source, Np background events are, 
in general, recorded for longer time ft). The signal/background ratio can be estimated 
as the number 7, of standard deviations of the signal over the background (i.e. as the 
standard Gaussian variable), normalizing the background counts to the time interval 
t;. In this step, attention must be paid to the calculation of the background standard 
deviation, which is o[Ny] = /Nots/tp. This means that the Poissonian error of 
the measured counts must be evaluated before the multiplication by the constant 
ts /tp, according to the error propagation law. In fact, an algebraically manipulated 
Poissonian variable no longer follows the original distribution. 

Therefore, one obtains: 


Ns4b — Np ts/th 
Neue + Np t2/t5 


In the search of new phenomena, physicists speak about strong evidence whenng ~ 
3, whereas a discovery is claimed whenn, > 5. See Problem 12.3 for an application 
of this formula. 

We will now discuss a typical nuclear physics counting experiment. 


Nog = 


(12.67) 


Exercise 12.5 

In 1930 the physicist L.F. Curtiss performed an experiment to determine the 
statistical law describing the particle emission in the decay of a radioactive 
nucleus. Using a Geiger counter, he recorded the number of a particles 
emitted by a thin Polonium film. During the experiment, the number of 
particles counted in 3455 time intervals of equal length (a few minutes) was 
recorded. If we define: 


(continued) 


560 12 Experimental Data Analysis 


Exercise 12.5 (continued) 
x = number of emitted particles (Geiger counts) 


nx = number of equal time intervals with x counts 


The results of the experiment, reported in [Eva55], can be summarized as: 


oY 0 1 xz 3 4 5 6 7 
Nx 8 59 177 311 492 528 601 467 
a 8 9 10 I1 12 #13 «14 = «15 
ny 331 220 121 85 24 22 6 3 


(12.68) 


From this table it results, for instance, that there were 8 intervals without 
counts, 59 intervals with one count only, 177 intervals with two counts and so 
on, up to a limit of 3 intervals with 15 counts. 

Perform the complete analysis of the experiment. 


Answer The Polonium half-life is about 4 months, so the emission intensity of 
the source can be considered constant during the experimental data collection. 

The Geiger counter is a gas tube under electric voltage, in which a 
discharge occurs when an ionizing @ particle crosses the detector. The tube 
recharging time, needed to have a voltage value between electrodes sufficient 
to produce a new discharge, is of the order of one thousandth of a second. 
Since the experiment recorded less than 20 counts in a few minutes, the bias 
in the counts due to a particles entering the detector during the recharge (dead) 
time is absolutely negligible. Therefore, we are in the case of an experiment 
without errors in counting the number of emitted particles, that we previously 
labelled as M(f, 0, 0). 

Then, under the assumptions that the intensity of the Polonium source 
remains constant, and that all the nuclei of the sample are independent a 
particle emitters with the same constant probability over time (as in any 
nuclear model of radioactive decay), the experimental counts must be Poisson 
distributed. So let us verify this assumption. 

We first note that, having grouped the data as in Eq. (12.68), the spectrum 
of the random variable under examination is given by the number x of 
recorded counts, while the event frequencies are given by the number of 
time intervals with a given number x of counts. The spectrum frequencies 
are therefore given by: 


hea Ne 


(continued) 


12.14 M(f, 0, 0) Measurements 561 


Exercise 12.5 (continued) 
Using Eqs. (2.53), (2.55), and (2.58), we can then calculate from the sample 
mean, variance and fourth-order moment (you can also open and use the R 
console on your computer): 


15 


2S eS 


x=0 


N 
c= Tt Le — m)* fe = 5.859 
1 
Dy= = Gm) f= 107.07. 
x 


The equality between mean and variance is evident, in agreement with the 
property of the Poisson distribution given by Eq. (3.16). In fact, the statistical 
estimate of the mean is given by Eq. (6.50): 


S 
= m+ — = 5.877 + 0.041 ~ 5.87+ 0.04, 
ia VN 
whereas that on the variance, from Eq. (6.63), is: 


D4 — s4 
N 


Go =e = 5.859 +0.145 ~ 5.86+0.14. 


For this calculation we used the general formula, since the mean value < 10 
does allow us, strictly speaking, to apply Eq. (6.79), only valid for Gaussian 
variables (see Table 6.3). In this case, however, also Eqs. (6.68) (or (6.79)) 
give a Statistically identical result: 


sets? iia = 5.859+0.141 ~ 5.86+ 0.14. 
va 


Confidence intervals for mean and variance can be associated with Gaussian 
probability levels, because N = 3455 >> 100. Then, we can proceed to the 
difference test of Eq. (7.6) (neglecting the covariance between M and S) 


= 5" Si = 5. 
ae |m — s?| _ (5.8 5.859| Sie 


~ /s2(m) + 52(s2) 0.0412 + 0.1452 


(continued) 


562 12 Experimental Data Analysis 


Exercise 12.5 (continued) 
which gives a result compatible with 4. = o7 within 0.12 error (standard 
deviations) units. 

From this preliminary analysis, we have obtained a first important verifi- 
cation in favour of the Poisson distribution. We can go on and perform the 
x? minimization with respect to ju. The function to be minimized is given by 
Eq. (7.36): 


ya yy mee 


’ 


nx 


where: 
ee 
PQ; wy) = —e™ 
ao} 


is the Poisson density with an unknown mean yp to be determined. 
With our routine Nlinfit (see Problem 12.16), we have obtained the 
following values for the mean and the minimal x7: 


= 5 SOG -t 0104 ey S64 


To perform the x? test, we first need to determine the number of degrees of 
freedom. The histogram has 16 channels and was obtained with a predeter- 
mined total number of N = 3455 events, since this is nothing more than the 
number of count tests performed by Curtiss. This decreases the degrees of 
freedom by one unit (if in doubt, re-read Sects.6.14 and 7.5). Furthermore, 
the parameter 2 has been estimated from the data, which decreases the 
number of degrees of freedom by another unit, which is therefore equal to 
v = 16—2 = 14. The reduced chi square x? has the value: 


18.64 
2 
= ——— = 133. 
X14 14 

From Table E.3 we obtain a p-value of about 15%, in good agreement with 
the Poisson distribution. 

The final data can be summarized in a table or, more briefly, in a graph, 
reporting, for each value of x, the value of n,., its statistical error (the error 


(continued) 


12.14 M(f, 0, 0) Measurements 563 


Exercise 12.5 (continued) 


bar) and Poisson’s law prediction. For example, Eqs. (3.14) and (6.106) give 
forx =5: 


5.866° 
ny + Jny = 528423, N p(5; 5.866) = 3455 as ne = 568. 


The fit result is shown in Fig. 12.10, where an excellent overall agreement 
between data and theory can be noticed. 

The mean value found by the experiment, once normalized to the unit of 
time and divided by the number of radioactive nuclei present in the source, 
gives the Polonium a decay constant (with its statistical error). This constant, 
usually referred to as 4, is a fundamental intrinsic physical characteristic of 
the Polonium nucleus. 


number of 
650 | counting intervals 


600 + ; 
550 F b 
500 + 

450 + 
400 + 
350 + 
300 ¢ 
250 + 
200 + 
150 + 
100 F 


50 . . counts 


12 3 4 5 6 7 8 9 10 11 12 13 14 15 


Fig. 12.10 Comparison between data (points with error bars) and estimated Poisson density (solid 
line histogram) in the case of the Polonium radioactive decay experiment 


564 


Exercise 12.6 

Two counting experiments of the same phenomenon recorded 92 events in 
100 s and 1025 events in 1000 s, respectively. Evaluate the weighted average 
of the two results. 


Answer It is not possible to directly evaluate the weighted average of the raw 
data, because they refer to different counting times. However, since the counts 
come from the same source and the second measure is ten times larger, the 
data normalization can be performed as follows: 


my €92+/92 = 9249.6 


1 1 
— 1025 + —V 1025 = 102.5 + 3.2 
Oy Sa MS 10 025 2D ae 3%, 


where Eq. (6.106) has been used and the second measure has been normalized 
(with error) to the first one. Note that the second measurement gives a more 
precise result, because, as explained in Sect. 6.14, for Poissonian events the 
relative statistical error goes as ,/n/n and then decreases when the sample 
size increases. 

Since the order of magnitude of the recorded counts is a hundred, these 
Poisson variables (having a mean > 10) can be considered Gaussian. We can 
therefore apply the weighted average formula. 

The weights of m, and m2 are given by: 


1 


Pl= oe 


1 
oe == — 0.008. 
GE 


and the estimate of the true mean is: 


= 920.011 + 102.5 -0.098 1 
es 0.011 +. 0.098 V 0.011 +.0.098 


= 101.4+3.0 counts in 100 seconds. 


To conclude, it is worth noting two very general facts: the final data is closer 
to the most precise measurement (the second, which has a weight ten times 
greater than the first), and the inclusion of the first measurement, even if much 
less precise, still reduces the error, even if only slightly. 

Note also that, if one merges the counts of the two experiments, a total 
rate of 1117/1100 + /(1117)/1100 = 1.01 + 0.03 counts/s is obtained, in 
agreement with the previous result. 


12 Experimental Data Analysis 


12.15 M(f, o, 0), M(f, 0, A) and M(f, o, A) Measurements 565 


Fig. 12.11 The folding ; 
effect: the physical quantity x physics apparatus 
and the apparatus response z > 

Suc ie 2 Xx Z 
combine into a function h, to 
give a measured value y 


which is a function of these 
two variables 


observation 


y = h(z,x) 


12.15 M(f, o, 0), M(f, 0, A) and M(f, o, A) 
Measurements 


The analysis of this class of experiments is very complicated, because the fluctua- 
tions of the values assumed by the phenomenon are coupled with the fluctuations 
and uncertainties due to the measurement apparatus. The goal of the analysis 
is to determine the function f(x), dependent on one or more variables, which 
characterizes the physical statistical law describing the observed phenomenon. 
However, what is directly observed is a density g, dependent on the intrinsic 
fluctuations both of the observed quantity and of the measurement apparatus.” First 
of all, it is therefore mandatory to experimentally determine the response of the 
measurement device, called instrument function or apparatus function, which has 
the following meaning: 


Definition 12.9 (Apparatus Function) The apparatus or instrument function 
5(y, x) dx dy gives the probability that the value of the physical variable is within 
[x,x + dx] and that a value within [y, y + dy] is measured. 


Basically, the apparatus function is the probability that the input value is x and the 
apparatus gives an output quantity y. 

The density of the observed data g(y) therefore depends on two variables (Z, X) 
(apparatus and physics), and the measured values of Y are linked to these variables 
by the relation (also schematized in Fig. 12.11): 


y=h,x), z=h'ty,x), 


where the / function represents the link between z and x due to the measurement 
operations (a sum x + z, a product x - z, or other functions). 

If now pz(z) and px(x) = f(x) are the p.d.f. associated with the measurement 
device and with physics, respectively, Z and X are usually independent, because it 


? For brevity, we say we observe or measure a density g as a shorthand for observing or measuring 
a sample from g. 


566 12 Experimental Data Analysis 


is reasonable to suppose that the behaviour of the apparatus is independent of that 
of the physical process. Then, Eq. (5.28) holds, and, for convenience, we rewrite it 
with the new notation now given to variables: 


dh} 
dy 


dx. 


20) = f pz (io.9) £o0 
The apparatus function can be now identified with the quantity: 


ah! 
dy 


pz (h-'(,x)) —— = 80,3), 


which is just the density of the apparatus pz(z) evaluated for z = h7!(y, x) and 
both connected to x and y through the measurement procedure. The derivative 
dh—! /dy is the Jacobian of the transformation. 

Summarizing we can affirm that, based on the laws of compound and total 
probabilities, at the basis of Eq. (5.28), the three functions f(x) (physics), g(y) 
(observation) and 6(y, x) (apparatus) are linked by the relation: 


20) = f Fedo. as =fx6, (12.69) 


which can be interpreted as follows: the probability of observing a value y is given 
by the probability that the physical variable assumes a value x times the probability 
that the instrument, given an input value x, provides a value y. These probabilities 
must be added (integrated) over all the spectral values of x of the spectrum that can 
have y as observed value. 

Equation (12.69) is called folding integral, and it is sometimes indicated with 
an asterisk symbol. This integral is very important both in physics and engineering. 
Since g(y) is measured and 6(y, x) must be known, the experiment must determine 
the function f(x) which is, so to speak, “trapped” or “wrapped” in the folding 
integral. 

When the apparatus response is linear (as often happens): 


={ aha! 
yH=h(z,x)=z+x, z=hY,x)=y-x, 5 =1, 
y 


the folding integral transforms into the convolution integral (5.27): 


20) = f Fepay = sae, (12.70) 


which is sketched in Fig. 12.12. 
The techniques that extract f(x) from the integrals (12.69) and (12.70) are called 
unfolding and deconvolution techniques. In principle, the technique of Fourier or 


12.15 M(f, o, 0), M(f, 0, A) and M(f, o, A) Measurements 567 


observer 
gly) 'S 
[>_> — 
A 1 
instrument y-x y 
oe y 
ms | F [aq-w | 
f(x) | : 
v 
Fig. 12.12 Graphical representation of convolution 
folding 
ae gy, Ww) =f*d 
initial set of 
parameters 
[gf w) 
A 
NO compare (x7) 
g(y, L) with 
observed g(y) 
| YES 
new set of 


parameters 
fix, W) = fx) 


Fig. 12.13 Iterative method to solve a folding integral 


Laplace transforms allows us to convert these integrals into easily solvable algebraic 
equations. However, the error propagation through these algorithms often creates 
considerable difficulties. Usually the iterative technique schematized in Fig. 12.13 
is used. A function f(x; 4) dependent on one or more parameters ju is “injected” 
into the folding integral, which is solved with numerical or simulation methods; the 
obtained solution g(y; 2), also dependent on the same parameter set, is compared 
with the function g(y) actually observed. If best-fit methods are used (as those used 
in the Nlinfit routine), the jz parameters at x7 minimum are evaluated with the 
negative gradient method, and the x? test at the end of the procedure allows the 
determination of the agreement of the folding solutions with the observations. If 
agreement is not satisfactory, the previous procedure is repeated with a different 
function and a new set of initial parameters. In this way, we obtain the desired 


568 12 Experimental Data Analysis 


function f (x, 40), with the optimal values of the parameters jo and their associated 
error. This method, although rather computationally expensive, has the advantage of 
being quite flexible and general. 

We do not intend to go into further detail here, since a very abundant specialized 
technical literature is available (see, e.g. [Blo84]). In our opinion, it is sufficient for 
you to be aware of the problem and to know that there are ways to solve it. If you 
follow the path of experimental scientific research career, then there is certainly a 
folding or convolution integral waiting for you. When you stumble on it, do not be 
discouraged, and search in the literature for the best method to solve that specific 
problem: it almost certainly exists. If the error calculation or the solution stability 
is important, remember to pay attention to the type of algorithm to use; this book 
perhaps will help you make the best choice. 

We have shown that the complete and reliable determination of the density f(x) 
is generally complicated due to the presence of the apparatus function 5(y, x), 
which takes into account the instrumental (A) and random (c) effects present in 
the measurement process. In the case of convolution, i.e. when: 


observation = physics + apparatus => y=x+(y—-—x) = y=x4d, 


if not the complete density structure, at least the mean (X) and the variance Var[X] 
(or the standard deviation) can be estimated from the sample quantities m(x) and 
s?(x) with the known rules (5.67), (6.65), (6.71). 

First of all, we note that the average deviation m(d) and the dispersion caused by 
the measurement operations (the latter characterized by the standard deviation s?(d) 
and/or by the systematic error A) must be known with negligible error; otherwise the 
experiment is not feasible. Moreover, also m(y) and s(y) with their uncertainties are 
known from the observed data. Then, for the M(f, o, 0) measurements, we have: 


s*(x) = s?(y) — s?(d), 
m(x) = m(y) — m(d), 
. 5(Xx) 
ere 
_ s(x) 
— /2N 


In the case of the M(f, 0, A) measurement, we have instead: 


w= m(x)4 (12.71) 


o~s(x)d 


2 


s(x) = $y) — — 


m(x) = m(y) —m(d), 


12.16 A Case Study: Millikan’s Experiments 569 


amo as, (12.72) 
5(x) 


J2N 


on~s(x)t 


Finally, for M(f, o, A), we have: 


2 
7a) =) =8@)= < 


m(x) =m(y)—m(d) , 


while Eqs. (12.72) still hold for uw ando. 


12.16 A Case Study: Millikan’s Experiments 


We think that it is very instructive to apply some of the concepts introduced so far 
to the analysis of Millikan’s famous experiments on the electron charge. In the past 
years, these experiments have attracted the attention of some historians and critics of 
science, opening up some controversy on the way scientists operate [Fra84, Fra97]. 
Let us see what it’s about. 

Robert Millikan was an American physicist who became famous for his mea- 
surements on the electron charge. For these researches he obtained the Nobel Prize 
in 1923 and numerous other awards, including 20 honorary degrees. In 1910, when 
he was a professor at the University of Chicago, Millikan published the first results 
of his experiments, obtained by studying the fall of oil droplets in an electric field. 
The experimental set-up is shown in Fig. 12.14. Exploiting the Venturi effect, an 
air flow A captures tiny droplets emitted from the oil ampoule O; these droplets 
fall by gravity into a hole made in a capacitor C, where there is a static electric 
field produced by the voltage generator P. The apparatus is kept depressurized 
and controlled by a pressure gauge. The droplets become negatively electrified by 
contact with the air flow, and their motion is therefore influenced by the presence of 
the electric field of the capacitor, whose polarity slows down the motion of the drop, 
since the lower plate has a negative potential. 

It is known that the fall of a body in a viscous fluid (air) becomes rapidly a 
uniform motion, with constant velocity, when the viscous friction force, which 
according to Stokes’ law depends linearly on the speed, becomes equal to the 
weight. This is the same motion of raindrops, or of a skydiver, both in free fall 
and with the parachute open. Using therefore Stokes’ law and the equation of the 
electric field of a capacitor, Millikan wrote the equations of motion of the drops with 


570 12 Experimental Data Analysis 


Fig. 12.14 Millikan manometer to the pump 
apparatus for the SS 
measurement of the electron Tal 


charge. A = air flux, C = 
capacitor, M = microscope, P 


= high-voltage generator, O = 
oil ampoule 
= a A 
2 WN ry 
an, O 
wl : 
and without a field as: 
_— 
6m nvor = gn pg (without electric field) , (12.73) 
4 , V . . 
6m nur = a pg - 74 (with electric field) , (12.74) 


where 77 is the air viscosity coefficient; v and vo the drop falling velocity with and 
without the field, respectively; r is the droplet radius; ¢ the oil density; g the gravity 
acceleration; V the applied voltage; h the distance between the capacitor plates; and 
q the droplet charge, which is the quantity to be measured. 

By measuring the fall velocities v and vo with and without field, from 
Eqs. (12.73) and (12.74), it is possible to obtain the unknowns r and q. The 
oil drop velocity was evaluated by measuring the time of fall of each drop with 
a microscope, looking through a lens with notches. The drop velocity could be 
adjusted at will by changing the voltage V or the air pressure. It was also possible 
to keep the droplets suspended in the air (v = 0) by applying a suitable negative 
potential to the lower plate of the capacitor. 

In this way Millikan noticed that the charge q was always an integer multiple 
of a base value go. He wrote all of his observations in a log that is still available 
today. The measurement error was assumed to be purely statistical, deriving from 
the uncertainty in the manual determination of the falling times with the microscope 
and a clock. Following our notation, we are therefore in the presence of a M(0, o, 0) 
measurement. 

Millikan averaged a first group of 23 measurements, calculated the error of 
the average as prescribed by Eq. (12.59) and published in 1913 the value of the 
elementary electron charge: 


qo =e = 4.778 +0.002 107'® esu . (12.75) 


12.16 A Case Study: Millikan’s Experiments 571 


A controversy concerning the stability of the measurement and the small error 
assigned to the value of e prompted Millikan to publish, in 1923, a new study of 
the fall of 58 drops, in which he emphasized that the data were not belonging “to a 
selected group of falls, but represented all falls occurring in 60 consecutive days”. 
The value of this second measurement was: 


e = 4.780+0.002 107! esu . (12.76) 


After a re-examination of Millikan’s notes, it appears however that in the first 
measurement (that of the 23 drops), he excluded from publication 7 measures judged 
as “bad”, while in the second measurement (that of the 58 drops), he even excluded 
82 values. Based on these findings, Millikan is often cited as a negative example of 
scientific dishonesty. However, as he has been resting in peace for many years now, 
it is not known whether the drops were discarded on good grounds (i.e. measures not 
properly carried out by him or his assistants) or only with the intention to minimize 
the error. However, a re-analysis of Millikan’s data showed that the discarded drops 
do not modify the published result in a statistically significant way [Fra84]. At 
this point you might have wondered what the true value of the electron charge 
is. Obviously, we do not know the true value, but the weighted average of all the 
measurements performed to date, which turns out to be: 


e = 4.803 2068 + 0.0000015107!° esu. (12.77) 


This value is known with a relative error of 3-10~7 (0.3 parts per million) and can be 
considered as the true reference value for the present discussion. The comparison 
between this value and that of Millikan (12.75) can be made with the standard 
variable: 


|Millikan value — true value| — [4.778 — 4.803] _ 125 
Millikan error 7 0.002 aaa 


The Millikan value is 12.5 error units away from the true one, in absolute violation 
of the 30 law. From a formal and methodological point of view, Millikan performed 
a wrong measurement, because the true value is not covered by the interval (12.75) 
according to the 30 law. 

However, it should be noted, in favour of the scientist, that Millikan demonstrated 
for the first time that the electric charge is quantized and determined its value with 
a relative error of: 


14.778 — 4.803 | 


~ 0.005 = 0.5% , 
4.803 


which seems to us a very respectable result, also considering that period of time, the 
originality of the measurement and the type of apparatus used. 

However, the anomalous difference (outside the statistical error) among Millikan 
results (both original and correct ones) and the true value remains to be clarified. 


572 12 Experimental Data Analysis 


The reason lies in the physical model used for the measurement, represented by 
Eqs. (12.73) and (12.74): an approximate 7 coefficient of air viscosity was used, 
and, in determining the final value, Millikan did not take into account the systematic 
error resulting from this approximation. 

We conclude this discussion with the words of another Nobel Prize for physics, 
Richard Feynman [Fey18], who quoted the Millikan measurements as an example 
of the so-called bandwagon effect, which we will discuss in the next section: 


We have learned a lot from experience about how to handle some of the ways we fool 
ourselves. One example: Millikan measured the charge on an electron by an experiment 
with falling oil drops, and got an answer which we now know not to be quite right. It’s a 
little bit off, because he had the incorrect value for the viscosity of air. It’s interesting to 
look at the history of measurements of the charge of the electron, after Millikan. If you 
plot them as a function of time, you find that one is a little bigger than Millikan’s, and the 
next one’s a little bit bigger than that, and the next one’s a little bit bigger than that, until 
finally they settle down to a number which is higher. Why didn’t they discover that the 
new number was higher right away? It’s a thing that scientists are ashamed of this history 
because it’s apparent that people did things like this: when they got a number that was too 
high above Millikan’s, they thought something must be wrong and they would look for and 
find a reason why something might be wrong. When they got a number closer to Millikan’s 
value they didn’t look so hard. And so they eliminated the numbers that were too far off, 
and did other things like that. We’ve learned those tricks nowadays, and now we don’t have 
that kind of a disease. 


12.17 Some Remarks on the Scientific Method 


After the description of the technical contents of the measurement operations, we 
want to conclude the chapter (and the book) with some general observations on the 
scientific method. 

As we mentioned at the beginning, the scientific method is based on the 
observation of nature, through the measurement of physical observables with the 
procedures and techniques described in the previous sections. 

This approach is based on a principle without which all modern sciences could 
not exist: the postulate of objectivity. In other words, nature exists by itself, with 
its own laws, and is not a projection of the human mind. It is therefore not allowed 
to attribute to nature the aims and purposes of the subjective and cultural world of 
the experimenter. Objectivity must be considered as a postulate, since it cannot be 
demonstrated on the basis of more obvious or elementary assumptions. 

However, when this principle is not valid, as happens in some medical exper- 
iments, scientists are able to recognize this fact. We refer to the so-called placebo 
effect, which consists in the spontaneous healing of patients who are made to believe 
that they are being treated with an effective drug, while in reality they have taken 
a substance (such as water or sugar), called placebo, with no therapeutic effect. 
As experimentally verified, particular psychological conditions induced in confident 
patients can lead, for some pathologies, to a sizeable percentage of healings [Bro13]. 


12.17 Some Remarks on the Scientific Method 573 


So how do you check the effectiveness of a drug? The method is precisely that 
of the contingency tables explained in Exercise 7.8: a placebo is administered to 
sample A of patients, and the drug to sample B; the x? test is then applied to 
verify if drug gives statistically different effects from placebo. This methodology 
is known as “double-blind experiment’, because neither the doctor nor the patient 
know if they are part of the group that is testing the drug or the placebo. Homeopathy 
and pranotherapy are among the most popular practices that have never passed the 
double-blind test. Itis therefore correct to state that these “therapies” heal; however, 
it should always be remembered that in these cases therapeutic effects significantly 
different from the placebo were never measured. 

Apart from these situations related to biomedical experimentation, scientific 
results are absolutely independent of any conscious or unconscious experimenter 
wish. Of course, it is not excluded that this situation may change one day or another, 
but up to now, this has never happened in any measurement correctly carried out in 
the fields of chemistry, physics and biology. 

In non-technical language, understandable to any thoughtful people, we could 
say that the scientific method can be summarized in two equivalent and completely 
obvious principles, even if unfortunately little used: theories collapse in front of 
facts (and not vice versa), or similarly reality must be analysed as it is and not as 
one would like it to be. 

Technically, this methodology corresponds to an attitude that in science, 
following Popper [Pop59], is known as falsification procedure: if you have a theory, 
you must try to prove that it is false; if even a single experimental fact disagrees 
with it, this theory must be revised. On the other hand, if all the experimental results 
agree with it, the theory must be (temporarily) accepted. Therefore, there are no true 
theories, but only incorrect theories (or valid only for a limited range of phenomena), 
as they have been falsified by one or more experiments, and valid theories that are 
not yet falsified. The latter, like the theory of relativity and quantum mechanics, 
are part of the current scientific heritage. We invite you to re-read the epigraphs of 
Fermi and Einstein at the beginning of Sect. 7. 

At this point it should be quite clear that the statistical comparison between an 
experimental result and a model, through the calculation of the significance level 
and the evaluation of the probability of making a mistake by rejecting (falsifying) a 
true hypothesis (type I error), is a particular but fundamental technical aspect of the 
general falsification procedure of modern science. As an example, you can review 
Exercises 3.17, 7.2, and 12.5. 

Another important aspect is that the theories and models, to have scientific 
validity, must be falsifiable, that is, it must be possible in principle to establish that 
they are false. If I state that when the patient heals the treatment works, while if the 
patient does not heal it is because he/she is not tuned with cosmic energy, it is clear 
that the treatment I propose is not falsifiable, since it cannot be disproved by any 
experiment. 

The distinctive feature of the falsification process, and of all persons having 
a scientific mentality and culture, is to focus attention on rejections and failures 
rather than on the successes of a theory or hypothesis. This point is fundamental 


574 12 Experimental Data Analysis 


and by no means obvious. Suppose you make a horoscope and predict that persons 
with a particular zodiac sign will get a flat tire during a certain week. Among 
(let’s say) 5000 people reading this prediction, it will happen by chance (say) to 
10 people. These ten people, if not scientific-minded, will be impressed by your 
prediction, will become your followers and will spread their positive experience to 
others. By continuing to make horoscopes, your predictions will always be crowned 
with success (in a statistical sense); in this way you will gradually accumulate 
a considerable number of fans and could perhaps become a famous (and rich) 
astrologer. 

The falsification process is not to be understood in a rigid and schematic sense: 
if an experiment falsifies a theory that has withstood hundreds or thousands of 
previous experiments, it is reasonably more likely that that single experiment is 
wrong and the theory valid, rather than the other way around. Frequently, amateur 
scientists claim sensational discoveries, such as that of perpetual motion or the 
non-conservation of the electric charge; if these experiments were true, the whole 
edifice of modern physics would collapse. A lot of carefulness is therefore needed 
in judging and evaluating new sensational scientific results. 

Science can go ahead according to the scheme described so far only if scientists 
are in good faith and have a reciprocal control and there are mechanisms verifying 
scientific information. The rules adopted by the professional scientist community 
require that scientific results, to be considered as such, must be published in 
“reliable” journals, which are edited by some of the world’s leading experts of the 
subject (all those working in a specific field know who are these experts). 

Theories can thus be checked and experiments repeated: then a transitory phase 
is triggered at the international level, which we could define as validation, at the 
end of which the theoretical or experimental scientific results can be considered 
acquired. Statistical data analysis and the comparison among different experiments 
almost always play a leading role at this point. Scientists therefore operate in a 
sort of free zone, self-managing scientific information in absolute autonomy. This, 
in our opinion, is one of the most significant positive features of our times. In 
order to safeguard that autonomy that science has been able to acquire, it is then 
very important for every researcher to be aware how important it is to publish a 
scientific result in a journal or communicate it in a congress and operate with care 
and correctness. 

After the description of the system physiology, we now want to discuss some pos- 
sible pathologies. Concerning the execution of the experiments, they are essentially 
of three types: 


e Use fraud. 
e Treat data incorrectly (data cooking, trimming ...). 
e Fall into the expectation bias (named also bandwagon effect). 


We will now briefly review these three pathologies. 

The use of fraud in science, that is, the existence of unscrupulous scientists, 
fortunately proved to be a secondary aspect. The university selection mechanisms 
and the cross-checking between scientists, through conferences and scientific 


12.17 Some Remarks on the Scientific Method 575 


journals, have so far made it possible to promptly reveal frauds and tricks. They have 
mainly occurred in the biomedical sector, where there is a more frequent interplay 
between science and economic interests. The case of W.T. Summerlin, who in the 
mid-1970s painted black patches on the fur of white mice to simulate grafts, is often 
cited as an extreme example [Hix76]. 

The case of E. Rupp is instead famous in physics. As he later admitted 
himself, he literally invented, at the beginning of the 1930s, some results on the 
polarization of double electron scattering. It should be noted that the results of Rupp 
undoubtedly falsified Dirac’s relativistic theory of the electron, which already had 
several experimental confirmations. However, the physicist community, particularly 
active and lively, quickly discovered the deception, forcing Rupp to retract his 
results [Fra84, vD07]. 

The second pathology, the incorrect data treatment, consists in cleaning opera- 
tions (trimming) and manipulation (cooking) of the results and is much more subtle 
than the previous one, as it is much more difficult to discover. It is not due so 
much to the bad faith of the experimenter, as to the ignorance of data analysis and 
processing techniques. The most frequent mistakes concern the failure to correct 
systematic effects and the incorrect application of statistics and of techniques that 
calculate and propagate measurement errors. Millikan’s experiments, discussed in 
the previous section, fall into this category. In fact, we do not know if Millikan 
dishonestly discarded some measures; however, it is certain that he did not take into 
account important systematic effects. 

Another aspect of this pathology is the overestimation of measurement errors, 
to confirm previous measures or theories that are considered as reliable. In this 
way, possible significant discrepancies between theory and experiment are kept 
hidden, totally nullifying the principle of falsification. This too demonstrates, once 
again, how important it is for an experimenter to be able to correctly evaluate the 
measurement uncertainties. 

Multiple repetitions of experiments and cross-checks among scientists are the 
weapons used successfully by the scientific community to identify incorrectly 
performed experiments or technically wrong theories. But sometimes they are not 
enough. As an example, let us examine the modern evaluations of one of the 
most important physical quantities, the Newtonian constant of gravitation G, taken 
from [RS17] and shown in Fig. 12.15. It is evident that the “Big G” measures are 
incompatible. Are discrepancies due to miscalculated errors or to a dependence of 
G on the experimental materials? This is a still open point, linked to the possible 
presence of an unknown gravitational component of very weak intensity (the so- 
called “fifth force”), which would deprive G of the characteristic of universality. 

And now we come to the most subtle and dangerous effect of all, the expectation 
bias effect, often referred to as bandwagon effect [Jen06], which has already been 
clearly described in the quote by Feynman reported at the end of the previous 
section. As you have surely understood, this is the psychological influence on the 
experimenter of the pre-existing scientific situation. In fact, researchers obtaining 
results in strong disagreement to accepted theories or previous measurements 
usually scramble to search for possible experimental biases until when they get 


576 12 Experimental Data Analysis 


6.671 6.672 6.673 6.674 6.675 6.676 
G/(10 ''m? kg-!s-2) 


Fig. 12.15 Some recent measurements of the universal gravitation constant G and their measure- 
ment uncertainties. The numbers on the right denote the years when the results were published. The 
vertical black line and the grey band give the recommended value (2014) and its lo uncertainty, 
respectively, (from [RS17]) 


0.6 

p 

0.4 Fe e 

0.2 | | | | 
194g +«1952s«s1956 = :1960~—:1964 


Year 


Fig. 12.16 Michel’s » parameter measurements as a function of the year in which they were 
performed. Within errors, any measurement is in agreement with the previous one (adapted from 
[LW65]) 


a result close to expectations. At this point they are satisfied and become less 
careful. This, however, can generate situations where the experimental results are 
all in agreement with each other and all wrong. A good example is Fig. 12.16, 
taken from [LW65], which shows the successive measures of Michel’s o parameter, 
characterizing the energy distribution of the electrons emitted in some nuclear 
decays. Each experiment is in agreement with the previous one, but the first ones 
are incompatible with the value of 0.75, which is the expected theoretical value. 


12.17 Some Remarks on the Scientific Method 577 


25 = 
nN x10? 


before 1973 after 1973 


Fig. 12.17 Measures of the parameter n,— ordered following their publication date (from [Jen06]) 


Moreover, measurements are approaching the theoretical expected value in an 
increasing monotonous way. Very likely, these results are affected by the bandwagon 
effect. 

Another example, reported in [Fra84], is the measurement of the parameter n+_, 
which is the ratio between the decays of some unstable nuclear particles, the K 
mesons. This parameter has a crucial importance in elementary particle physics, 
and, despite of the experimental difficulties, it has been measured several times over 
the years. The obtained results are shown in Fig. 12.17: their averages before and 
after 1973 are incompatible. Also here a bandwagon effect within the two groups of 
measures seems evident. The currently accepted value (2020, see [ZeaPDG20]) is 
the weighted average of the second group of measurements: 


n— = (2.232+0.011) 1077. 


Unlike previous effects, the expectation bias does not imply bad faith or ignorance 
of the experimenter. On the contrary, it is a psychological effect which can affect 
also well trained, very experienced and excellent researchers. As pointed out in 
Feynman’s quote, probably the best remedy against this effect is to be aware that it 
exists. 

We think we have demonstrated that scientists, contrary to popular belief, are 
not infallible, nor do they consider themselves to be so. However, although we are 
aware that we are about to make a very demanding statement, we would like to end 
this book by saying that the scientific method, with its technical and fundamental 


578 12 Experimental Data Analysis 


principles, is flawless in the long run and is able to be immune from possible 
mistakes of individual scientists. 


12.18 Problems 


12.1 If the mass m and the acceleration a are independently measured with a 
relative statistical error of 4% and 5%, respectively, which is the relative statistical 
error on the force F = ma? 


12.2 A radioactive source having a mean life t = 5 days, and with an initial activity 
of Ig = 1000 decays/s known with negligible uncertainty, has now a residual activity 
of J = 10 decays/s, with an uncertainty of sy = +1 decay/s. Given the decay law 
I/Ip = exp(—t/t), determine the elapsed time with the corresponding error. 


12.3 A radiation monitor records J = 157 counts in 1 s in the presence of the 
source and F = 620 background counts in 10 s without the source. Determine the 
number of source counts, with its relative error, after background subtraction and 
the signal/background ratio. 


12.4 The expected background of a counting experiment (accurately measured in 
the calibration phase) is ten events/s. In one test (background plus signal), 25 counts 
are recorded in a second. Using the Gaussian approximation, find the upper limit of 
the signal counts only with CL = 95%. 


12.5 A scientific result is reported as 5.05 + 0.04, specifying that it is the average 
of four measurements and that a CL = 95% is associated with this interval. Find 
the accuracy of the single measurement. 


12.6 The volume of a cylinder V = (2 R*)L is obtained by measuring the base 
radius R and the height L. To improve the accuracy of the indirect measurement of 
V, is it more important to reduce the percentage error on R or on L? 


12.7 An angle has been measured as 6 = (30 + 2)°. Calculate sin @ together with 
its error. 


12.8 A polarized particle beam is scattered onto a target, and the polarization 
percentage is measured as P = (Ny — N_)/(Nz + N_), where N+ is the number 
of particles deviated “up”, N_ is the number of those deviated “down” and N = 
Nx + N_ is the total number of scattered particles. It is well known to physicists 
that the measured polarization should be written as P--s(P) = P+,/(1— P?)/N. 
Check the error formula, both for fixed and variable NV. Use the results of Sect. 6.14. 


12.18 Problems 579 


12.9 A voltage is measured with a class | instrument as V = 10.00+0.05 V, where 
the error is the sensitivity interval. Determine the value of the function f(V) = 
exp(0.1 V) with its error. Discuss the confidence levels. 


12.10 The output voltage Ez of a divider with two resistors Ry and R2 and input 
voltage E, is given by Ey = FE, R1/(Ri + Ro). If Ry = Ro = 1000+ 50 2 and 
E is measured, with 1% accuracy, as E = 10.00 + 0.05 V, calculate Ey. Check 
the confidence levels with a simulation. The measure at 1% of the output voltage is 
Ey = 4.91 + 0.02 V. Verify if this value is in agreement with the predictions. Note: 
all uncertainties are maximum sensitivity errors (CL = 100%). 


12.11 The constant speed of a slide moving horizontally on a cushion of air is 
measured as v = s/t. The space covered is s = 2 m, with an uncertainty estimated 
at +1 mm due to the sensitivity of the used metric tape. The time t, measured 
by repeating the test 20 times, shows random variations, and its histogram has 
parameters m; = 5.35 and s; = 0.05 s. Determine the slide speed and comment 
the obtained result. 


12.12 Write the matrix V of Eq. (12.24) for a common systematic error propor- 
tional to the measured value x; Ge =€-Xj). 


12.13 Verify Eq. (12.7) using simulation. 


12.14 Consider a measurement of the variable Z = XY where X,Y ~ U(0,1). 
Find the coverage probabilities of the intervals 4 = Ko with K = 1, 2,3. 


12.15 A rectangular flat counter records 1750 events in 1 s. The counter dimensions 
are a = 30+ 0.5 cm and b = 50 + 0.5 cm, where the error is systematic. Find the 
counting frequency and its error in (1/m?s) units. 


12.16 Use the Nlinfit routine to estimate the value jz of the Poissonian density 
of Fig. 12.10. 


12.17 The energy of photons from an atomic transition is transformed into electri- 
cal pulses whose peak voltage is measured with a multi-channel analyser. On the 
display monitor, the spectral line shape is a Gaussian with o ~ 20 channels. The 
calibration of the analyser with a fixed voltage value delivered by a pulse generator 
gives a Gaussian curve of o9 ~ 5 channels. This value represents the dispersion 
introduced by the measurement. Determine the actual line width o,.. 


580 12 Experimental Data Analysis 


12.18 Perform the linear best fit to the data of Eq. (12.27) with the covariance 
method (see Eq. (12.21)) assuming as oss a multiplicative error of 20%. Compare 
the result with that of Table 12.1. 


12.19 Which of the two conjectures, (a) “some elephants fly” and (b) “all elephants 
can fly”, has scientific validity? 


Appendix A 
Table of Symbols 


Ox 


Ox 


Meaning 

Real numbers 

Real n-ple 

Random variables 

Random variate: observed value (occurrence or outcome) 
of a random variable in an experiment 

n-dimensional random sample of X: 

X = (X1, X2,...Xn) 

Vector of m random variables: X = (X,, X2,...Xm) 
n-dimensional random sample of X: X = (X,, X2,...Xn) 
Occurrence of a random sample 

Mean or expected value operator of X 

Value of the operator (X) 

Variance operator of X 

Value of the operator Var[X] 

Standard deviation of X: ./Var[X] = o[X] 

Value of the operator o [X] 

Covariance operator of X and Y 

Value of the operator Cov[X, Y] 

Sample mean 

Observed value of M 

Sample variance 

Value of S? after an experiment 

Sample standard deviation 

Value of the standard deviation after an experiment 


or statistical error (called also root mean square) 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per 11 3+2 
139, https://doi.org/10.1007/978- 3-03 1-09429-3 


581 


582 


Symbol 
Sy, S(X, Y) 
p(x, y) 
r(x, y) 
#(A) 
——d 

Px (x) 
p.d.f. 

Fx (x) 
f() 
fO) 
N(u, 07) 
B(x; MW, o) 
P(x) 

ly 

U(a, b) 
x) 
QV) 

x2 

xg 

xR) 
Or(v) 


Meaning 

Value of Cov[X, Y] after an experiment 

True value of the correlation coefficient 

Sample value of the correlation coefficient 
Approximatively equal 

Distributed as 

Number of elements satisfying the condition A 
Implies 

Probability density function of X 

Probability density function 

Cumulative (distribution) function of X 

A function of x 

Functional form of a variable 

Normal or Gaussian distribution of parameters jz e 
Probability density function of X ~ N(w, 07) 
Cumulative function of X ~ N(0, 1) 

Quantile of the Student’s or Gaussian distributions 
Uniform distribution in [a, b] 

Chi-square distribution with v degrees of freedom 
Variable distributed as x2(v) 

Values assumed by Q(v) after a trial 

Quantile of the x? density 

Reduced chi-square distribution with v degrees of freedom 
Variable distributed as xR (Vv); Or(v) = Q(v)/v 
Values assumed by Q r(v) after a trial 

Quantile of the xR density 

Likelihood function 

Maximum likelihood (ML) or least squares (LS) 
point estimate of a parameter 0 


A Table of Symbols 


Appendix B 
R Software 


The R software was created in 1994 in the Statistics Department of the University 
of Auckland, New Zealand, as an evolution of the S language, a statistical software 
developed in the Bell laboratories by a research group led by John Chambers. 
Since 1997, it has become an international standard and is distributed as the 
Comprehensive R Archive Network (CRAN). To date, it is the most popular and 
most used statistical software; it is free and open source, i.e. public. 

The codes follow standards similar to structured C++ programming, and the 
routines are extremely easy to use. R now contains thousands of routines and 
allows you to perform the same calculation in many different ways. We assume that 
you have installed R on your computer and found one of the many good manuals 
available on the web, so that you are able to use the R routines and to create simple 
R scripts. 

In this Appendix we collect some methods that we have frequently used 
throughout the book. They are certainly not the only ones and are perhaps not the 
best to be used. Nevertheless, they well satisfy one of the purposes of the book, that 
is, to verify and describe, using simulation and data science techniques, the basic 
concepts gradually introduced in this text. 

The best thing one can do to solve a problem with R is to query a web search 
engine: usually the answer is easily found, because that problem has already been 
addressed and solved with R by someone else. For example, to sort a vector v, the 
function sort (v) can be used; if, on the other hand, one is only interested in 
having a vector of indices in ascending order, a simple web query will lead to the 
order (v) routine, which gives the sequence of the indices corresponding to the 
ordered elements of the v vector. By typing ?0rder in the R console, one will 
have access to the R html manual which will explain the details of the routine. The 
R’s prevailing type of learning is inductive and bottom-up, as is often the case in 
computer science. 

Let us now briefly summarize some ways of using R presented in the book. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 583 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3 


584 BR Software 


R’s prompt is >, and comments are identified by the character #. The base object 

of R is the vector, which is sized by what is on the right, which often uses the c () 
function. For example, with the command: 
> x <= (13527547) 
a numeric vector x of five entries is defined. The <- symbol is often used for 
assignment, which is equivalent to =. With the command ?c () , the numerous pos- 
sibilities of this function are listed. The vector x thus created can be mathematically 
manipulated without difficulty. 

For example, the command y <- x«*x or y <- x%*2 creates a vector y 
containing the squares of the entries of x, whereasy <- sqrt (x) creates a vector 
of square roots and so on. The symbols “...” and ‘...’ are equivalent. There 
are many useful functions that manipulate vectors. Some of them are reported in 
Table B.1. 

Matrices can be created in different ways; the simplest one is perhaps to use the 
function matrix () . For example, the command; 
> X <- matrix(c(1,2,3,4,5,6) ,nrow=2,byrow=TRUE) 


generates the matrix: 
123 
(| 5 4) ‘ (B.1) 


The command ?matrix () lists all the methods available to generate a matrix. In 
the text we have sometimes used the functions cbind() and rbind(), which 
join matrices by columns and rows, respectively. As an example, the matrix (B.1) 
can be generated with the command: 

> x <- rbind(c(1,2,3),c(4,5,6)) 

In statistics, it is often necessary to sum across the rows or columns of a matrix 
or table. In this case it is better to use the function apply (), as in our routine 
CorrelEstuH. For example, the command sr <- apply(x,1,sum) creates 


Table B.1 Some R functions acting on a vector x 


Function Result 

range (x) Two-entry vector c(min(x) ,max (x) ) 
which. min (x) Integer, index of the minimum 

which. max (x) Integer, index of the maximum 

sort (x) Vector sorted in ascending order 

sort (x,dec=T) | Vector ordered in descending order 
order (x) Vector of indices of x giving the increasing order 
length (x) Integer, length of x 

sum (x) Sum of the entries of x 

prod (x) Product of the entries of x 

mean (x) Mean of the entries of x 


var (x) Variance of the entries of x 


BR Software 585 


Table B.2 R functions for statistical distributions 


Prefixes Function 

d Density value 

Pp Cumulative function value 

q Quantile value 

ie Random variate 

Suffixes Distribution 

unif Uniform 

binom Binomial 

norm Gaussian 

pois (lambda) Poissonian with mean lambda 
chisq x 2 

gamma (shape=1) Negative exponential 

gamma (shape=n) Erlang distribution of order n 
rayleigh(scale) Rayleigh distribution with o = scale 


a vector sr = c(6,15) with the sum across the columns of the matrix (B.1), 
whereas the command > sc <- apply(x,2,sum) creates a vector sc = 
c(5,7,9) with the sum across the rows. 

The R software has a lot of graphics options; in the book we have frequently used 
the routine hist (x) which creates the histogram from the raw data contained in x 
and the function plot (). These routines have many options that can be listed by 
typing ?plot or ?hist. The simplest command to have a line plot of a function 
with abscissae and ordinates respectively contained into the vectors x and y is 
plot (x,y,type="1"). The R software obviously contains functions handling 
all the most frequently used p.d.f. The name semantics is explained in Table B.2: 
the prefixesd, p, q, rare positioned at the beginning of the names, followed 
by the suffix which indicates the distribution type. For example, if x is a vector con- 
taining the support values, dnorm(x) gives a vector containing the corresponding 
ordinates of the standard Gaussian; x <- rpois(100,lambda=2) generates a 
vector x of 100 Poissonian random variates with mean yz = 2; qnorm(0.95) 
gives as output 1.6448, which is the Gaussian quantile corresponding to a 95% 
cumulative probability. Most of these routines have default values that can be 
changed by following the instruction manual. 

For the standard Gaussian, ..norm() has the default values u = 0,0 = 1, 
which, for example, can be modified by writing ..norm(..,mean=3,sd=2). 

While studying distributions, we often used the function, density (x), a 
powerful routine which produces density functions starting from a raw data vector x, 


586 BR Software 


by applying smoothing techniques.' Without going into the complex mathematical 
details of the routine, which can be found in the R manual, it is sufficient for the user 
to know that smoothing can be controlled with the parameter adj, which is | by 
default. A value adj = 0.01 practically reproduces the original data distribution, 
which is gradually smoothed by increasing the parameter value. The following 
instructions visualize the uniform distribution, essentially a smooth histogram 
connected by continuous lines, of 200 data randomly generated from the U(0, 1) 
p.d.f.: 


x <- runif (200) 
plot (density (x,adj=0.01),type='1’) 
grid() 


With the command grid (), a grid is superimposed to the plot. It is instructive 
to vary the adj parameter to verify how it affects the smoothing. 

In the case of joint two-dimensional distributions, if the raw data are contained 
in a two-column array x of pairs, the routine bkde2D can be used as follows: 


# 1000 gaussian pairs with 
# meanx=5, meany=10, varx=vary=3 and varxy=-2 
require (mvtnorm) 
x <- rmvnorm(1000,c(5,10) ,sigma=rbind(c(3,-2),c(-2,3))) 
# plot with persp(x1,x2,fhat) 
require (KernSmooth) 
hsm <- bkde2D(x,adj=0.01) 
persp (hsm$x1,hsm$x2,hsm$fhat) 


where again adj is the smoothing parameter and persp (x,y, freq) draws the 
bidimensional plot of the curve. 

Often in R the list of objects is used: for example, an object alis can be defined 
as alis<-list (a=1,b=2,c=3), and then the list members can be recalled 
using the symbol $; with the command alis$c, the answer is 3. Often the routines 
give an output list as return: it is therefore necessary to list the names of these 
variables (usually declared as Value in the user manual) and access them out 
with $. For example, if we make the histogram of a data vector x, the command 
hist (x) $counts will return a vector with the number of events in each bin. 


'The smoothing procedure replaces the original point of a function with the mean of the 
neighboring points. The function, therefore, appears “smoother” as the number of points used for 
the average increases. 


Appendix C 
Moment-Generating Functions 


In this appendix we briefly mention one of the most important methods for the study 
of distributions, the method of generating functions, which allows to solve many 
problems in a concise and elegant way. The method associates a new variable to the 
random variable X, defined by the function e! X where t is a real variable that has 
no particular statistical meaning. The average value (e! & } is a function Gy (t) which 
takes the name of moment-generating function (Mgf): 


>>; pie“ — discrete variables 
Gx0)= (e* = (C.1) 
f e'* p(x) dx continuous variables 


For t = 0, the sum and the integral are obviously always convergent. For t 4 0, 
Eq. (C.1) may not converge. Here we will deal with some cases, related to the most 
important distributions, in which there is convergence for values |t| < M, where 
M is an arbitrary positive number. For continuous variables the generating function, 
apart from a sign in the exponent appearing in the second of Eq. (C.1), it is the 
Laplace transform of the density of X. Expanding Gy into series and exploiting the 
linearity properties of the mean, one obtains: 


ce i 
Gx) = 140 (X) + 5 (x7) + 2{ Joe. (C.2) 
From this result the important relation: 
d"Gy 
(x") = <= | = GY (0), (C.3) 
t" lo 


follows, which connects the n-th derivative of G at the origin to the n-th order 
moment of X. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 587 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per il 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3 


588 C Moment-Generating Functions 


asendiarane : : p.d.f. Generating function Gy (t) 
Moment-generating function - - : = 

(Mgf) of some probability Binomial b(@; n, p) [pe +(1— p)] 
distributions Poissonian p(x; [) exp[u(e’ — 1)] 


Exponential e(x; X) A/O —t) ,t< Xr 
Gaussian g(x; 4,0) | exp (us + 0717/2) 
Chi-square x7(v) (1 —2r)-¥/? 


By solving the integrals or sums of Eq. (C.1), one can easily obtain the generating 
functions of the most important probability distributions. Some of these are shown 
in Table C.1. 

The importance of the moment-generating function is due to the following 
property: if Y = X, + X2 and the random variables X; and X> are independent, 
then from Eq. (4.9) it follows that the Mef of Y: 


Gw= ee) es ee) =6¢,06zn,0; (C.4) 


is the product of the Mgf of X; and X2. Obviously this property also holds for a 
sum of k densities, so that the Mgf G7, (y) of the Erlang density (3.57), which is the 
sum of k exponential times, results in: 


k x k 
Gz) =[[GnO= (5) re (C.5) 


; rX.-t 
i=1 


where the exponential Mgf of Table C.1 has been used. This property can also 
be checked directly by solving the integral of Eq.(C.1) for the Erlang distribu- 
tion (3.57). If we denote the time variable by x instead of f to avoid notation conflicts 
with the Mef variable, we can write: 


i me of : 
Gn (t) => oa (axe! exp[—(A = t)x] dx = (=) > (C.6) 


where the standard result has been used: 
[oe] 
/ x* exp(—ax) dx = aa *+D(k +1)!, a>0, k integer. 
0 


The Central Limit Theorem 3.1 for a sum of n variables X; all having the same 
density of parameters jz and o can be easily proved with the use of Mef. Indeed, if: 


o Vixi/n—u_ 1 Vi®i-wH _ 


y =e Zi = 
a//n ~ J/n or ~ at & es ae 


C Moment-Generating Functions 589 


from Eq. (C.4) one obtains: 


= fists (S) SoS) Be] - (C8) 


Since (Z) = 0 and (Z?) = |, passing to the limit the result: 


1 a 
lim Gy(t) = lim (1+-5) ef /? (C.9) 
n—>0o n—>0o n2 
is obtained showing that, for large n, the Mgf of Y tends to that of a standard 
Gaussian density (see Table C.1). 


Also the additivity theorem (3.4) for x7 variables can be easily proved with Mef. 
If: 


Y= Q(1) + Q(r2) (C.10) 


and Q(v1) ~ x7(v1), Q(v2) ~ x7(v2) are independent, then G,,(t)G,,(t) = 
Gy(t) so that 


Gy) = (1-2 OM? —> ¥~ 7703) wet (C.11) 
from Table C.1. If instead Q(v1) ~ x7(v1) and Q(v2) ~ x7(v2), it is easy to 
show, by inverting Eq. (C.4), that, when Y and Q(v2) are independent, the following 
property holds: 


OV) =Y+Qm) = Y~x?1-1»). (C.12) 


Appendix D 
Solutions of Problems 


Chapter 1 


1.1. 


1.2. 


If C is the change of door, A “is the car behind the first chosen door” 
and C and A the complementary events, from the partition theorem, one 
has: P(C) = P(C|A)P(A) + P(C|A)P(A) = 0-1/3 + 1-2/3 = 2/3, 
P(C) = P(C\A) P(A) + P(C|A) P(A) = 1-1/3+0-2/3 = 1/3. It is better 
to change the door. 

From Eq. (1.31) it results that the number of possible games is 521/(13!)4 
~ 5.36 - 1078. Since the number of games played is ~ 5.475- 10!4, P ~ 
1.02- 107", 


. P= pi[l—(1— po) — p3)] = 0.776. 
. (a) Elements 1 and 2 are in parallel, and they are in series with the parallel 


combination of elements 3 and 4. (b) P = [1 — (1 — p)?}* = 0.922. 


. P=1- (5/6) = 0.421. 
. (a) P= 1 — 7/10 = 0.30; (b) P = 1 — 120/720 = 0.83. 
. We outline the not simple solution: one can imagine the arrival of one of the 


two friends as a point x or y located at random within a 60 min long interval. 
X and Y do not meet if the {X ¢ [y, y+ 12], Y ¢ [x, x + 10]} event occurs. 
So P = 1 — (48/60)(50/60) = 1 — 2/3 = 1/3. 


. From the total probabilities formula: P(T) = 0.14. From Bayes theorem: 


P(B|T) = 0.678. Also the graphic method of Fig. 1.6 can be used. 


. P{X < Y} = 1/2. Indeed, it is reasonable to assume the probability as the 


ratio of the area above the diagonal to the total area of a unit square. 


. With obvious notation: 


P(A)P(D|A) + P(B)P(D|B) + P(C)P(D|C) = 0.165 > 16.5% 


. PCC => RIC = 150, R < 155) P(C = 150, R < 155) = (1/2)(5/15) (5/20) 


= 0.0417. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 591 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3 


592 


1.12. 


1.13. 


1.14. 
1.15. 


1.16. 


1.17. 


1.18. 


D_ Solutions of Problems 


One obtains P(H,) ~ P(H2) ~ 0, P(H3) = 0.03, P(H4) = 0.22, P(H5) = 
0.47, P(H6) = 0.28. Compare the results with those of Table 1.2. 

From Eq. (1.41), if V; is the event where the friend wins n consecutive games 
(V; = V), defining the hypotheses B = cheat and O = honest, assuming 
P(B) = P(O) = 0.5, and P(V|O) = 0.5, P(V|B) = 1, the following 
recursive formula is obtained: P(B|V,) = P(B|Vn—1)/[P(B|Vn-1) + 
0.51 — P(B|Vn—1))]. One obtains P(B|V5) = 0.97, P(B|Vio) = 0.999, 
P(B|Vis) = 0.99997. See how the results change as P(B) and P(QO) change. 
(1) True, (2) false, (3) true, (4) false. 

P(21|20)P(20) = 0.2-0.4 = 0.08, P(21|21)P(21) = 0.6- 0.4 = 0.24, 
P(21|22)P(22) = 0.2-0.1 = 0.02, from which P(20|21) = 0.24, 
P(21|21) = 0.70, P(22|21) = 0.06 from the Bayes formula. 

We need to find the probability P(H|V) that a winner is honest, knowing that 
the probability that an honest man will be a winner is P(V|H) = 107°. 
The two probabilities are connected by the Bayes theorem: P(H|V) = 
P(V|A)P(A)/[P(V|H)P(A)+P(V|D)P(D)]. The probability of winning 
dishonestly is P(V|D) = 1; however, if we assume that there are no 
dishonest players cheating, P(D) = 0, and it turns out that, as it should be, 
P(A\|V) =1. 

We must find the probability P(C|DNA) that a positive person is guilty, 
in a test where the probability for an innocent is P(DNA|I) = 1074. 
Assuming that in the tested population the probability that a person is 
guilty holds true is P(C) = 0.5 - 1074 and that is innocent is instead 
P(1) = 1— 107+ ~ 1, from Bayes theorem one obtains P(C|DNA) = 
P(DNA|C)P(C)/[P(DNA|C) P(C)+P(DNA|I) P(D)] = 1-0.5-1074/[1- 
0.5- 10-4 + 10-4. 1] = 0.33. 

The probability to extract 2 identical cards (apart from the suits) from a deck 
of 52 cards is given by the hypergeometric law (1.33) with a = 4, b = 52, 
k=2,n=2,n—k=0: 


4! 2150! 4! — 2 
212) 52! ~ 2152-51 52+51 
The probability to get a particular face throwing a dice is 1/6. The probability 


of the event is then the product of the probabilities: 


12 1 2 


SS 0. 
32-516 52-51 1326 1° 


Chapter 2 


2.1. 
2.2. 


b(2; 3, 1/6) = 5/72. 

The number of favourable cases is 10!/(3!7!) = 120 and that of the possible 
ones is 2'!0 = 1024. The probability is 720/1024 = 0.117, according to the 
binomial density. 


D_ Solutions of Problems 593 


2.3. 


2.4. 


2.5. 
2.6. 


2.7. 
2.8. 


2.9. 


2.10. 


2.11. 


From the binomial density: 1 — b(0; 10, 0.1) —bC; 10, 0.1) —b(Q; 10, 0.1) = 
0.0702, that is about 7%. 

(X + Y) = (X)+(Y)= 100-18/37+0-19/37 + 360-1/37+0-36/37 = 58.4. If 
you imagine a very large set (tending to oo) of N players, each with a starting 
capital of 60 euros, after a bet the average capital of this set will have dropped 
to 58.4 euros, with an average profit in favour of the dealer of 1.6 euro per 
player. 

(X) = 6, 0[X] = 2.83, (Y) = 13, o[Y] = 5.66. 

The solution is given by the hypergeometric law (1.33). The R routine 
dhyper can be used with the call dyper (k,7,3,3) with k=0,1,2,3. 
The results of the following table are obtained: 


r 0 1 2 3 
P(r) 6/720 126/720 378/720 210/720 


This is an example of a hypergeometric distribution. Note that the density is 
normalized. (R) = 0.9, o[R] = 0.7. 

pt) = 2x, F@) =27,(X) = 2/3, 01M] = 1/62), 

F(x0.25) = 248 = 0.25, from which xo.25 = 0.5. The probability to observe 
values < 0.5 is 25%. 

The journey takes 6 h, 4 on the way and 2 h on the way back. If the velocity 
V is considered as a two-valued variable, one has (V) = 25-4/6+50-2/6= 
200/6 = 33.3 km/h. Note that the result (V) = (50 + 25)/2 = 37.5 km/h is 
wrong. 

The trials are 1000 (instead of 100), and the spectrum values are two: x = 0, 1 
(heads or tails, instead of the 11 values x = 0, 1,...10). From Table 2.2, by 
summing the products of the first column by the second one (0-0+1-0+ 
2-5+3-13+...), we obtain 521 heads and, obviously, 479 tails, to be 
compared with an expected value of 1000-b(1, 0.5, 0) = 1000-5(1, 0.5, 1) = 
1000 - 0.5 = 500. 

100 sorted binomial variates are generated with the calls: 


x < —sort(rbinom(100, size = 20, prob = 0.3)), 


y < —sort(rbinom(100, size = 20, prob = 0.3)) 


and the plot is generated with the call qqplot (x,y), in which the irregular 
structure is due to the presence of repeated discrete values. The comparison 
with the theoretical distribution requires first the construction of a vector x1 
with cumulative probabilities associated with the discrete values x. This can 
be achieved with the call: 

for(k in 1:length(x) )x1[k] =length (which (x<=x [k] ) ) 
/length (x). 


594 


D_ Solutions of Problems 


The expected quantiles are obtained with qth<-qbinom(x1,size=20, 
prob=0.3). 

With the command qth<-qth[1: (length (qth) -1)], the last value 
(= 1 by construction) is excluded. Finally, the plot is obtained with 
qaplot (qth,x). It is useful, using the plot, to examine the effect of 
varying prob in the calculation of qth. 


Chapter 3 


3.1. 
3.2. 


3.3. 


3.4. 
3.5. 


3.6. 


yo, PAY > X|X =x}P{X =x} > fd —x) dx = 1/2. 

The density of X is binomial (Gaussian), so that (X) = 500-0.5 = 250, 
o[X] = V500- 0.25 = 11.18. The 3-sigma law is valid. 

After 1 steps, the path X has a spectrum X = n,n —2,..., —n of discrete 
values which differ from each other by two units. The variable Y = (X-+n)/2 
is binomial and assumes integer values within the interval [0, 1]. Therefore, 
one has (Y) = ((X) +-n)/2 =np = n/2 and Var[Y] = Var[X]/4 = np — 
p) = 125. Finally, it results: (X) = 0 and o[X] = 2/125 = 22.36. The 
3-sigma law is valid. 

(a) 0.01576; (b) 0.016. 

mw = 200-0.2 = 40,0 = V¥200-0.2-0.8 = 5.66. Using the Gaussian 
approximation and Table E.1, one has: P{—1.77 < t < 1.77} = 0.923. 


1 1 
(y) = —=— ex |-sos0n _ | » yodO. 
P\Y ey p AG2 ype 4 


It is enough to note that In Y is Gaussian and to remember that dIn y = dy/y. 
This p.d.f. is known as log-normal distribution. 


. FWHM=2.3550. 
. (a) 0.0695 ~ 7%; (b) 0.7165 ~ 70%, independently of the previous 8 months. 
. x’/v = 7/10 = 0.7, corresponding to a probability (significance level) ~ 


28%, given by Table E.3. 


. (8500 — 8100)/V8500 = 4.34. The decrease is significant. 
. 1—exp(—100- 0.001) = 0.0952 ~ 9.5%. 
. From Poisson density: P{X = 0} = exp(—10/2) = 0.0067. The probability 


to be wrong is then ~ 0.7%. 


. If {X; + X2 = n} is the sum of counts recorded in disjoint time intervals, 


from Newton’s binomial formula, it is possible to prove that P{X; + X2 = 
n} = 0, [P{X1 = k}P{X2 =n — k}] when the probabilities are calculated 
with the Poisson density (3.47). 


. [—0.675, 0.675], by linear interpolation from Table E.1. 
. From the interpolation of Table E.1, one obtains P{t > 0.806} = 0.21 and 


P{t > 1.55} = 0.06. From the values (4.41 — )/o = 0.806 and (6.66 — 
LL)/o = 1.55, one obtains ~ = 2 and o = 3. A numerical check can be done 
with the R function pnorm. 


D_ Solutions of Problems 595 


3.16. 


3.17. 


3.18. 


3.19. 


3.20. 


Using the hour as time unit, for t¢ = | hour, At = 100/120 = 0.8333, and 
hence, from the Poisson p.d.f: p(4) = (at)* exp(—At)/4! = 0.00873. The 
waiting time is then given by 1/0.00873=114.5 hours corresponding to 4.77 ~ 
5 days. 

The cumulative function is F(x) = i (2x — 2)dx = x? — 2x + 1. From 
Theorem 3.5, X* — 2X + 1 = R where R ~ U(0, 1). The second degree 
equation, for 1 < x < 2, gives the solution X = 1+ VR. X is generated 
according to the assigned density with the command X = 1 + /Random, 
where 0 <Random< | has a uniform distribution. 

(a) 4.6% from Table E.1; (b) since Var[S>)2, xi] = 1007[x] = 250, one has 
of} 2; xi] = 15.8 and t = |1050 — 1000|/15.8 = 3.16 corresponding to a 
probability 8 10~* ; (c) < 25%, from Chebyshev’s inequality. 

< X >=1.5,< X* >=2.9. < 30X >= 45, o [30 X] = 24; (80—45)/24 = 
1.46, corresponding, from Table E.1, to a probability of 7.2%. 

The number of defects is np + /np( — p) = 10.0 + 3.1, so that t = |16 — 
10|/3.1 = 1.93. From Table E.1, it results that to this value gives a right-tail 
probability of 2.7%, which is the requested percentage. 


Chapter 4 


4.1. 


4.2. 


4.3. 


4.4. 


4.5. 


4.6. 


4.7. 


4.8. 


p(x, y)dxdy = P{x < X <x+dx,y < Y < y+ dy} = dxdy/(ab) for 
O0<x<a,0<y <b; p(x, y) = 0 otherwise. 

(a) (Xi — m1)? /o? = 2.71; (0) D1 (Xi — mi)?/o? = 4.61; (0) D7 (Xi - 
fi? oF = 6.25. The values are obtained from Table E.4. 

From Theorem 4.1 and Eqs. (4.14) and (4.16), it results that the inequality is 
valid. 

The tangents as in Fig. 4.5 are traced, and the intersection points (x2, y2) and 
(x1, y1) are found. From Eq. (4.55) one then has (0, /oy) (y2—y1)/(%2-x1) = 
p. 
Cov[Z, U] = (ZU) — (Z) (U) = ((aX + b)(cY +. d)) — (a (X) + b)(e (Y) + 
d) = ac((XY) — (X) (Y)). Since o[Z] = ao[X] and o[U] = co[Y], one has 
PLZ, U] = (XY) — (X) (¥))/([X]o[Y])) = LX, Y]. 

It is enough to use Eq. (4.79) with different limits: P{X > 180, Y > 170} = 
[0.5 — E(0.625)] - [0.5 — E(0.833)] = 0.266 - 0.202 = 0.0537 ~ 5.4%. The 
values of the function E(...) can be obtained from Table E.1. 

It would be necessary to integrate Eq. (4.40) over the region X € [180, +00), 
Y e€ [170, +00). Notice also that the concentration ellipse does not give the 
correct solution in this case. See Problem 8.11 to evaluate a solution using 
Monte Carlo methods. 

The density of pxy is given by the table at the top of the next page, while its 
marginal densities are in the table below. From these tables it results that the 
variable X, corresponding to the first die, is uniformly distributed, while Y, 
which is correlated to X, has a different density. From Eqs. (4.17) and (4.18) 
it is possible to calculate means and standard deviations: uw, = 3.5, wy = 
3.25, 0, = 1.708, oy = 1.726. Finally, the covariance can be calculated with 


596 


D_ Solutions of Problems 


(x,y) | pxy | (,y) | pxy | @,y) | pxy 
d,1) |4/36 | B,1) 1/36 | (5, 1) 1/36 
C1, 3) | 1/36 | GB, 3) | 4/36 _G, 3) | 1/36 
(5) |1/36_| G5) | 1/36 _|(G,5)__| 4/36 
(2,1) | 1/36 | (4,1) 1/36 | (6, 1) 1/36 
(2,2) 13/36 14,3) 1/1/36 |,3) 11/36 
(2,3) | 1/36 | (4,4) 3/36 | (6,5) 1/36 
(2,5) 11/36 |(4,5) |1/36 |(6,6) | 3/36 


xy }1 |2 13 4 5 |6 
px | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 
py | 1/4 | 1/12 | 1/4 | 1/12 | 1/4 | 1/12 


Eq. (4.21) and from the tabulated values of pyy. The value oxy = 1.458 is thus 
obtained. 


Chapter 5 


5.1. 


5.2. 


5.3. 


5.4. 


5.5. 


5.6. 


5.7. 


Since, for a uniform variable X, one can write P{Y = —2InX > y} = 
P{X < e7¥/?} = e—/*, the cumulative function of Y is F(y) = 1 — e~/”. 
The derivative of the cumulative provides the requested density: p(y) = 
dF/dx = e~»/*/2,0 < y < 00. Equation (3.67) shows that a x? distribution 
with 2 degrees of freedom has been obtained. 

For x > 0 there is one root x = ./z. Then, one obtains pz(z) = 2(1 — 
/2)/(2./2) = 1/./2 = 1,0 <2 <1. 

Defining the auxiliary variable W = Y, one has X = ZW and Y = W;; the 
derivatives 3 f;'/0Z = W, df, '|/AW =z, df, '/0Z =0, df, '/aW =1 
allow the calculation of the Jacobian |J| = |W|. The result is then pz = 
J \w| pxy (zw, w) dw. 

From Eq. (5.32) one has pz(z) = i Py (Z—x) px(x) dx, with x > 0, (z-— 


x) > 0, so that pz(z) = fy py(z — x) px(x) dx = zexp[—z]. 

The procedure of the previous problem must be extended by induction, 
and the Erlang or gamma distribution ["(n, A) is found, which is e,(t) = 
AAD Le fn — 1h. 

The conditions 0 < Z < 1, W > O holds, and the inverse functions are X = 
ZW and Y = W(1 — Z). The Jacobian is |J| = |W| = W, and the requested 
density is pzw(z, w)dzdw = wexp[—w] dzdw. From Theorem 4.1, Z and 
W are independent, and Z is a uniform random variable. 

From Eq. (5.27), taking into account that z = f(x, y) = xy, 7 y= 
z/y, 0f—'/dz = 1/y and that in the ratio x = z/y the condition y > z must 
hold to assure the limits 0 < x < 1, one obtains pz(z) = fid/y) dy 
—Inz, forO < z < 1. From the integral fz" Inzdz = Inz z"*!/(n 
1) — z"*!/(n + 1)’, one easily obtains (Z) = 1/4 = 0.25 and Var[Z] 
(Z7\ — (Z)* = (1/9) — (1/16) ~ 0.049. 


Il + 


D_ Solutions of Problems 597 


5.8. 


5.9. 


5.10. 


If 7; and 7> are the failure times, the system stops working at atime T = 7; + 
T>. The p.d.f. is then given by the convolution of two exponential densities. 
One then has A? f exp[-A(t — u)] exp(—Au) du = wt exp(—At), which is 
the gamma density (3.57). The mean life time of the system is (T) = 2/). 
Compare this result with that of the Exercise 4.1. 

Case (a): (Z1) = 0, (Z2) = O, Var[Z;] = 13 and, from Eq. (5.69), 
Var[Z2] = 1. Notice that in the linear approximation one should have, from 
Eq. (5.66), Var[Z2] = 0. Using the method of Eq. (5.84), one easily obtains 
Cov[Z,, Z2] = 0. The covariance and the correlation coefficient are null, 
even if a relation between the variable exists. Case (b): (Z1) = 5, (Z2) = 1, 
Var[Z1] = 13, Var[Z2] = 3, Cov[Z1, Z2] = 5, p[Z1, Z2] = 0.8. 

Since X and Y are Gaussians, from Exercise 5.3 it results the same also 
for Y — X. From Eq. (5.66) it results o[Y — X] = V/0.0202+ 0.0202 = 
0.0283. Using the standard variable, we obtain the values t; = (0.050 — 
0.100) /0.0283 = —1.768, t2 = (0.150 — 0.100)/0.0283 = 1.768, 
corresponding to an interval of probability P{0.050 < (Y — X) < 0.150} = 
2-0.4614 = 0.923, by interpolating from Table E.1. The requested percentage 
is then 1 — 0.923 = 0.077 = 7.7%. 


5.11. From Exercise 5.4, the density of the number of vehicles is Poissonian with 
mean jz + 4. From the binomial density, one then obtains P(k) = n!/[k!(n — 
Ku /(u +9 set aye, 

5.12. X<- rnorm(1000), y<- rnorm(1000), Z1l<- X+3*«Y, Z2<- 
5*«X+Y, cov (21,22) /sqrt (var (Z1)*var(Z2)). 

Chapter 6 

6.1. (a) 13 < R < 120, (b) 26 < R < 133,()17< R< 116. 

6.2. The solution is the value of ju satisfying the equation 4 — 1.28,/u — 20 = 0 
with zp > 20, so that w = 26.6. 

6.3. The expected value of the root mean square (standard deviation) coincides 
with the true value : (S) = 12 kg. Instead, one has o[S] ~ o//2N = 0.6 kg. 

6.4. From the equation Np = 215 + /215(1 — 0.2) = 215 + 13, it follows 
N = 215/0.2 + 1.65 - 13/0.2 = 1075 + 108, CL = 90%. 

6.5. The result follows from Theorem 6.1, since the sample variance S* is a 
function of the deviation, and from Theorem 4.3. It can also be directly shown 
that Cov[M, (X; — M)] = (X?) /N —(M?) = 07/N —07/N =0, since ina 
random sample the extractions are independent: (x iX i =Oifi F j. 

6.6. (a) 9.8+0.2; (b) (x —245)/5 = 1.645, from which x = 253.2, corresponding 
to an upper limit of 10.13 for the mean. One can also write (44 — 9.8)/0.2 ~ 
1.645, from which  < 10.13, CL = 95%. 

6.7. o € [—0.013, 0.305]. From the approximate formula (6.122), one obtains 
p €[—0.011, 0.311]. 

6.8. From error propagation and from the formula v = (x — b)/e, with b = 


5.30 + 0.23 counts/s, € = 0.90 + 0.10, x = 16.7 + 1.3 counts/s, one obtains 
v = 12.7+ 2.0 counts/s. 


598 


6.9. 


6.10. 


6.11. 


6.12. 


6.13. 


6.14. 


6.15. 


6.16. 


D_ Solutions of Problems 


The exponential time distribution has ~ = o = t. The mean sum of the 
times is Nt, and the standard deviation is o = Nt. From the inequality, 
(1000 — 100N)/(./N 100) < 1.64 follows the second degree equation, N* — 
22.7N + 100 = 0, whose solution is N = 11.6 x 12. 

The upper limit with CL = 95% is given by the equation 25 + 1.65,/u = wu, 
from which w = 34.7 ~ 35. The limit of the signal is then 35 — 10 = 
25 counts/s. Note: it is important to subtract the background after the 
calculations. 

The quantile of the reduced x? distribution is x7(24) R(0.95) = 0.58. We must 
then set s*/a* < 0.58, from which 07 > s*/0.58 = 31.03, s = /(18) = 
4,24, o > 5.56. 

One has: 30/20=1.5 defects/km. The error is obtained from Eq. (6.44) with 


t = I, or with PoissApp (30), because, if the confidence level is not 
given, by default CL = 0.68. One then has: jz = [24.54, 36.54] = 307937. 


The number of defects per km is obtained dividing by 20 result and error: 
w= Lato In an approximate way, from Eq. (6.45), one has w = 30/20 + 
30/20 = 1.5+0.27. Important remark: to find the error as /30/20 = 1.22 
is wrong. Indeed, the rule o ~ ,/x holds for original Poissonian variables 
only. A Poisson variable, if manipulated in any way, for example, by dividing 
or multiplying by some value, is no longer Poisson distributed. 

One obtains poisson.test (35,conf=0.95,alt="greater”) 
=25.87,PoissApp (35,conf=0.95,alt="low”) =25.78. 
Analytically, one has Cov[x, y] = 0, because x and y are independent. 
Instead Cov[x, yl] = (x — (x))Qr1 — (yl) = (xBx ty — Bx+y))) 
= (3x7 + xy — 5x) = 3(x?) = 3(x)? + 3a? = 3. Recall that, since x and y 
are independent, (xy) = (x)(y) = 0-5 = 0. For the correlation coefficient, 
one has Cov[x, y1]/(sx8y1) = 3//9a2 + 1 = 3/10 = 0.942. 

From the routine CorrelEst (x, y1), one has Cov[x, yl] = 2.91 + 0.57 


_ +0.019 
andr = 0.941" 9 p44. 


Analytically, one has (x) = (y) = 1/2, 0, = oy = 1/12 = 0.289, (yl) = 
3/2 = 1.5 and oy; = ,/402 + a = J/5/12 = 0.645. 

Hence, Cov[x, yl] = ((x — 1/2)(y1 — 3/2)), = (x — 1/2)(x + y — 3/2)) = 
(2x? —x-+xy — y/2—3x/2+3/4) = 1/6 = 0.166, since (xy) = 
(x)(y) = 1/4, because x and y are independent. For the correlation 
coefficient, one has Cov[x, y1]/(sx5y1) = 0.166/(0.289 - 0.645) = 0.890. 
From the routine CorrelEst (x, y1),one has Cov[x, yl] = 0.166+0.011 
andr = 0.90 + 0.02. 

Applying logarithms, the variance additivity formula (5.66) and the Wald 
formula for the frequency variance for big samples (6.33), one obtains 
Var{In(O R)]] = (1 — fi)/(fiNi) + . — f2)/(f2N2), where Ny = 11037 
and No = 11034. The CL = 90% interval, under the Gaussian linear 
approximation, must be multiplied by the quantile t = 1.65. We then obtain 
the logarithmic interval In(O R) + 1.65./Var[In(O R)] = —0.597 + 0.200 = 
[—0.797, —0.397]. 


D_ Solutions of Problems 599 


Returning to the original variables, one has 
OR = [exp(—0.797), exp(—0.397)] = [0.451, 0.672] = 0.557015. 


Chapter 7 


7.1. 


7.2. 


7.3. 


7.4. 


75. 


7.6. 


7.7. 


7.8. 


To use the routine TdiffMean, one needs the two sample standard devia- 
tions that are 0.05./10 = 0.158. With the call 
TdiffMean(c(5.36,5.21),c(0.158,0.158) ,c(10,10), 
alt=’two’),a p-value of 0.048 is obtained. The same result is obtained 
also under the hypothesis of equal variances, by settingvar=TRUE in the 
routine call. The 5% homogeneity test is (narrowly) not passed. 

Assuming that differences are due to chance (null hypothesis), Eqs. (7.42) 
and (7.43) provide the value xe) = 0.993, corresponding to SL ~ 34% 
(interpolated from Table E.3). 

The R routine chisq.test (rbind(c (40,28) ,c(30,30) , cont=F) 
gives a p-value of 31.7%. It is not possible to discard the hypothesis; 
therefore, it cannot be stated that the drug is effective. 

One uses Eq. (7.32), where Np; = 100m* exp[—m]/x;! ed m = 4.58 is the 
value of the sample mean from the data. It is necessary to group the first two 
and the last three channels, to have Np; > 5. One finds x7 = 2.925 with 
6 degrees of freedom, corresponding to ee (6) = 0.49. From Table E.3, one 
finds P{QRr > 0.49} ~ 0.82, giving a p-value of 2(1—0.82) ~ 0.36. It results 
that, on average, 4.6 buses arrive in 5 min and that the data are compatible 
with the Poisson distribution, because we would have a high probability to be 
wrong if we discard this hypothesis when true. 

x°(1) = (356—375)7/375+(144— 125)*/125 = 3.85. P{Q > 3.85} < 0.05 
from Table E.3. The model is rejected. 

(a) The non-parametric method of contingency tables gives, for 5 degrees of 
freedom, x2 = x7(5)/5 = 0.84/5 ~ 0.2, P{Or = xz} ~ 0.96, from 
Table E.3. The compatibility test is passed, but the data should be viewed with 
suspicion, because the x7 value is too small. (b) Applying Eq. (7.32) with 
Np; = 100 for all the 12 bins, one obtains x7 = 2.52. Each die contributes 
with 5 degrees of freedom, so that x7 (10) = 2.52/10 = 0.252, P{Qr = 
xz} ~ 0.99. If we say that the dice or rolls are rigged, we have a chance 
of being wrong < 1%. In fact, with 100 events expected per channel, the 
lo statistical fluctuations are ~ 100 + /100 = 100 + 10, and all the 12 
observed bins all fluctuate only within lo. It is therefore reasonable to discard 
the hypothesis, because the “data fluctuates too little”. 

The difference test gives tf = (60 — 33)/./60 + 33 = 2.8, from which, under 
the Gaussian approximation, P{T > 2.8} = 2.6 107%. This is the probability 
to be wrong in rejecting the limit ineffectiveness hypothesis if it were true. 
The x? is (29 — 19)7/19 + (18 — 19)?/19 + ... = 11.7 with 5 degrees of 
freedom, corresponding to a < 5% level. The test fails. 

From the difference test, one obtains (10500—2400—6700) /./10500 + 91000 
= 4.4. The two results are incompatible. Note that, after multiplication by 10, 
the variances of the first two measures are not 10 - 240 and 10 - 670, because 


600 D_ Solutions of Problems 


the scaled variables are no longer Poisson. The variance of the sum has been 
instead calculated as [10 - 240]? + [10 - /670]* = 91000. 
7.9. The x° test gives the result 


x? = (22—21.2)?/21.24+(12—11.6)7/11.6+(7—6.37)" /6.37+(6—3.49)?/3.49 = 1.91 


with 3 degrees of freedom. Since 1-pchisq(1.91,3) gives a p-value= 
0.59, there is agreement between data and model. 

7.10. Integrating the exponential law over the experimental time intervals, the 
following values are obtained: 393, 239, 233, 135. The x? test gives: 


(368 — 393)? (266 — 239)? 
a 
393 239 


Jee 7:69 


We have 4 degrees of freedom, corresponding to a significance level SL ~ 
13%. The model is accepted. 

7.11. Since x? = 2.5 with 3 degrees of freedom, SL ~ 40%, and the measures are 
compatible. 

7.12. The standard variable is (58 — 55)/(10//10) ~ 1. In order not to pass the 
test, a value of at least 1.65 is required, so the obtained value does not exceed 
the limits at the required level. 

7.13. The x? value is = (13 — 16)7/16 + (25 — 34)*/34 + (44 — 34)?/34 + (16 — 
16)7/16 = 5.88, corresponding to SL ~ 20%. The hypothesis is accepted. 

7.14. The x? value is: 


(18 — 15)?/15 + (14 — 15)?/I5+..=6.7, 


corresponding to a p-value, given by 1-pchisq(6.7,6), of about 35%. 
The results agree each other. 

7.15. The 15 values are loaded into a vector pv. Then the routine is called, for 
example, asMultiTest (pv, alpha=0.05,method=’bonferroni’). 
The problem is solved by changing the alpha values and invoking also the 
‘fdr’ method. The data are already sorted in ascending order. The number 
of discarded hypotheses due to significantly different parameters after the 
drug intake is listed in the table: 


Method a False hypotheses 


'bonferroni’ 0.05 3 
0.01 2 
'Edr’ 0.05 4 
0.01 3 


D_ Solutions of Problems 601 


7.16. 


7.17. 


7.18. 


It can be deduced that the first three parameters are certainly meaningful, most 
likely also parameter 4, which is spotted by ‘fdr’, with a maximum probability 
of false discovery of 5 %, given by the value of a. 

The data are present inside the vector breaks. It is necessary to calculate 
means and standard deviations of the three samples with the commands 
ml<-mean (breaks [1:9]) and sl<-sd(breaks [1:9] ) for sample 
one, and so on for the others, with their values contained in [10:18] 
and [19:27]. From the values m1,m2,m3,8s1,s2,s83 thus obtained and 
given the dimensions nl=n2=n3=9, we can perform the tests of Sect. 7.3 
using the routine TdiffMean. 

With the call TdiffMean (m=c (m1,m3),s=c(s1,s3),n=c(9,9), 
var=TRUE) ,a p-value ~ 2% is obtained for the difference between samples 
with high and low tension (m1, m3) and with medium and high tension 
(m2,m3). These results change only slightly with var=FALSE. A similar 
result is obtained (p ~ 1.6%) also with the pooled variance of Table 7.6 
s = ./(70.03/9) = 2.79 and using the approximate formula (7.6) with 
s = sl = s2 and x1 = ml and x2 = m3. Notice that Tukey’s test gives 
p-values between 4 and 6% (Table 7.8). The value is higher because Tukey’s 
test carries out three cross-tests and takes into account the test multiplicity. 
Consistent results are obtained if Bonferroni correction is applied to the 
values obtained with the t-test. 

From Table 7.6 one sees that the pooled variance is 70.03. Tukey’s quan- 
tiles (7.66) are then: 


ql = 9.444/sqrt (70.03/9) =3.385, 
q2 = 10.0/sqrt(70.03/9) =3.585, 
q3 = 9.444/sqrt (70.03/9) =0.199. 


With the command 1-ptukey (q1,nmeans=3 , df=24) , the first p-value 
of the table is obtained; replacing q1 with the other values q2, q3, the 
calculation is completed. (Notice that the parameter nmeans of ptukey 
in the online R handbook is erroneously defined as the group dimension, 
whereas it is the number of groups.) 

The days are loaded into a vector days of 25 components and the balm treat- 
ments into a vector balm with the command balm<-rep(c(’A’,’B’, 
'P’), c(8,8,9)). The table is created with test<-data.frame 
(day, balm), and the ANOVA test is performed with fit<-aov (days 
balm,data=test) and summary(fit). A p-value= 0.0096 is 
obtained, indicating the effectiveness of at least one treatment. Tukey’s 
test TukeyHSD (fit) indicates the effectiveness of A and an inconclusive 
value for B. 


602 


D_ Solutions of Problems 


Chapter 8! 


8.4. 


8.5. 


8.6. 


8.7. 


8.8. 


8.9. 


8.10. 


8.11. 


. No. 
. The probability value is 2/3. 
. It is necessary to generate two uniform variables X = 60 * rndm, Y = 


60 * rndm and to count the number of times —10 < X — Y < 12. The result 
must be statistically compatible with 1/3. 

The exact result, which can be obtained with non-trivial geometric consider- 
ations, is 1 — 3./3/(47) = 0.586503. Using the method of Problem 8.9, 
we obtained a fraction of pairs 3 614289/6 164 195, corresponding to a 
probability of 0.58634 + 0.00020. 

The highest generation efficiency (~ 78%) is obtained for the minimum k 
value satisfying the relation ke* > p(x) Wx > 0; this condition is verified 
fork > ./2e/z. To obtain a variable Z ~ N(O, 1), itis necessary to generate 


a new number é and set z = —x ifO < € < 05;z=xi1f05 <é€ <1. 
Alternatively, also the method described in [Mor84] can be employed. 
After a loop of n = 5 attempts is executed, the frequency f = x/n 


is calculated, with x equal to the number events satisfying the additional 
condition 0 < rndm < 0.25. The counters t, f2, f3 are incremented by one 
unit when p is inside the interval (6.31), calculated with the obtained value of 
f and with t = 1, 2, 3, respectively. This cycle must be repeated for a large 
number JN of times, and the final values of t1/N, t2/N and t3/N provide the 
requested levels. With N = 10000, we obtained t) = 6570, t2 = 9852, t3 = 
9994. Calculate the resulting frequencies with their error and compare them 
with the probability levels of the 30 law. 

With uniform variables, 4 = 0.5, and the uncertainty of the sample mean is 
o = 1/V12N. If N = 12, the standard variable value t = S7j=, & — 6 is 
obtained. This algorithm is implemented in the routine MCgauss1, present 
in our website. 

Y = pX + /1— 2 Yr, where X and Yp are standard variables from the 
gauss2 (0,1) routine. 

The variables —R < X < Rand—R <Y < R are uniformly and randomly 
generated and are accepted as coordinates when V X2 + Y? < R. 


The histogram comes from a population of density pz(z) = , known 


1 
m(1+z7) 
as Cauchy’s distribution, which has no mean and infinite variance. 
We need to generate two Gaussian variables of given mean, standard deviation 
and correlation: X = 8-g1 +175, Y = (p-git+1— p2-g2)-6+165, where 
g and go are standard normal variates from the gauss or gauss2 routines. 
Then, the percentage of the generated pairs having X > 180 and Y > 170 is 


evaluated. With 10 millions of pairs, we obtained (10.71 + 0.01)%. 


' The values obtained by the reader with simulation codes must be statistically compatible with 
the solutions reported here. Compatibility must be verified with the statistical tests described in 
particular in Sects. 6.12, 7.1, and 7.2. 


D_ Solutions of Problems 603 


8.12. 
8.13. 


8.14. 


8.15. 


8.16. 
8.17. 


8.18. 


8.19. 


The results must be compatible with the solution of Problem (5.9). 
Two uniform variates x = €, and @ = 2zé2 are generated. The number n of 
successes x + cos(d) < 0 or x + cos(@) > 1 over a total of N attempts is 
calculated. The estimate of 2 is 2N/n+(2N/n*),/n( — n/N). See also our 
Buf fon routine. 
A possible solution is our routine MCasimm. Remember that the uncertainty 
on the standard deviation of N, events is ~ o/./2N¢ (see Sect. 6). 
After excluding the extreme values p < 0.01 and p > 0.99, the condition is 
reached for n & 200. 
The condition is reached for w = 70. 
P{r* > 0} ~ 6% (r = sample correlation coefficient). The model must be 
rejected even if this value is only slightly above the limit. 
The routine Boot Odds solves the problem by generating 1000 values f; and 
jf2 using rbinom(1,size=N1,prob=f1), rbinom(1,size=N2, 
prob=f2) , from which a sample oddsboot of values f/f is determined. 
With the command 
quantile (oddsboot,c(0.05,0.95) ,names=FALSE) , 
the values [0.444, 0.668] are obtained. Compare this result with solution D. 
Tukey’s test considers the means of m Gaussian samples of size n and 
uses the quantiles of the statistic g = (Ymax — Ymin)/(/MSz/n), where 
MSE = pay, ij — ¥j.)7/(mn — m) is the pooled variance of the sample set. 
The solution is our routine MCTukey, which evaluates the quantiles in the 
following way: 
for(j in 1:Nsim) { 
for(k in 1:groups) { 
vg <- rnorm(nmeans) # sample of nmeans data 
meang[k] = mean(vg) 


devg[k] = var(vg)*(nmeans-1) # deviance 
} 
pool = sum(devg)/dof # pooled sample variance 
delta [j]=(max(meang) -min(meang) )/sqrt (pool/nmeans) #T. stat 
} 
qtl= 1.-alpha 
qtuk = quantile (delta,probs=qtl,names=FALSE) #Tukey quantile 


Chapter 9 


9.1. 


A possible solution is our code MCdel ta, which generates n = 100 Gaussian 
variates and calculates the difference A = Xmax — Xmin between the maximum 
and minimum values. The procedure is repeated a very large number N of 
times to obtain, at the end of the loop, the histogram of the variable A. An 
asymmetric histogram is obtained, coming from a population of unknown 
analytic form. With N = 50,000, we obtained, form = 100, a sample with 
mean and standard deviation m = 2.508 + 0.001, s = 0.302 + 0.001. From 
the graphic study of the histogram, we then determined the quantile value 
Ao.o9 = 3.32. The quality control must discard the batch when A > 3.32. 
Note that the method is insensitive to the shift of the mean (X). 


604 


9.2. 


9.3. 


9.4. 


9.5. 


9.6. 


9.7. 


9.8. 


D_ Solutions of Problems 


The 5 times are randomly generated from the exponential law, and m; = 
min(t, £3, ts), m2 = min(fo, f3, 44), m3 = min(f]}, t4) and m4 = min(fo, ts) 
are determined. The machine stops after a time t = max(m,,m2,m3, m4). 
By repeating the cycle 10,000 times, we obtained an asymmetric histogram 
with parameters m = 2.52 + 0.02 and s = 1.70 + 0.01 days. About 73% of 
operating times from the histogram are included in the m + s interval. 

It is necessary to modify the function Funint inside the routine MCinteg 
and to set Mode3=0 to exclude importance sampling. The upper limit 
x is the variable i2 which has to be modified to the appropriate value. 
The solution is the routine Mcintegl1 of our website. If stratified sam- 
pling is applied, our routine MCintopt can be used after the definition: 
funint<- function(x) exp(-x*x/2) /(2x*pi), and with the call 
MCintopt (funint, lower=0,upper=2) , when, for example, the inte- 
gral between 0 and 2 is evaluated. It is also instructive to perform a very 
accurate standard numerical integration with the R routine integrate and 
compare its results with the MC ones. 

The exact value of J can be computed analytically andis J = 2log2 —1 = 
0.38629 .... For N = 1000, the crude method gives o ~ 6.25 - 10-3: the hit 
or miss method o ~ 1.1-10~*; the importance sampling (choosing g(x) = x) 
o ~ 1-1073; and the stratified sampling (with k = 20 layers) o ~ 0.3- 107°. 
The result must be statistically compatible with the value J = 8. With 
the crude method and N = 1000, we obtained J = 8.01 + 0.13. After 
enclosing the integrand function into the region defined by the conditions 
—1 < x1, x2,x3 < 1,0 < y < 3, we have obtained J = 7.89 + 0.36 with the 
hit of miss method (again with N = 1000) . 

This equation represents an ellipse centred at the origin and with semi-major 
and semi-minor axes x and y of length 2 and 1, respectively. The rectangle 
surrounding the ellipse has an area A = 4,/2. The resolution code randomly 
extracts a point within the rectangle and accepts it if x7/2 + y? < 1. The 
ratio between the accepted points Ns and the generated ones WN gives the 
ellipse area with an error given by Eqs. (9.42), and (9.43). With the routine 
MCellipse and 100,000 points, we obtained an area of 4.433 + 0.007. 
The tests can be performed with our routine MCmet rop. Generating 10,000 
variates and using the last 5000, we obtained m = 0.029+0.018, 5 = 0.875+ 
0.010 with a = 2, andm = 0.017 + 0.024, s = 1.002 + 0.016 with a = 
3. One notes that with a = 3 the error increases, but the biased estimate 
of o is corrected. The results do not substantially change with a > 3, but 
longer sequences are needed to stabilize the result. The value a = 3.2 seems 
optimal. With 5000 standard variates from the routine runif, we obtained 
m = 0.023 + 0.014, s = 0.994 + 0.010, which is a more accurate result than 
that evaluated using the Metropolis algorithm with a = 3. 

Referring to the figure, one has to write a code generating the emission 
point with the formulae x = —2+ 4, y = —3 + 6 and the flight 
direction as cos9 = 1 — 2, @ = 27 &. One then has a = htgé, r = 


D_ Solutions of Problems 605 


9.9. 


9.10. 


Vx2+y2, R = Jr*+ a? —2racos®@. If R is less than the radius of the 
window Rg, the particle is counted by the detector. If m is the number of 
counted particles over a total of N emitted particles, the efficiency is given 
by « = n/N + /n —n/N)/N. Using the given input data, our routine 
MCdetector gives as result € = 0.059 + 0.002. 

A possible solution is given by our routine MCvmises, which uniformly 
samples within —z < x < 7, with the value of c given as an input parameter. 
The solution depends on the subjective evaluation of the evolution graph. The 
reader should be able to verify that, for n too high, the algorithm is not reliable 
in predicting the expected number of atoms with spin = 1. 


Chapter 10 


10.1. 


10.2. 


10.3. 


10.4. 
10.5. 


The probability to extract a black marble is 1/3 or 2/3. From the binomial 
density, the following table is obtained: 


x=O x=1 x%=2 x=3 x=4 
b(x; 4, p = 1/3) 16/81 32/81 24/81 8/81 1/81 
b(x; 4, p= 2/3) 1/81 8/81 24/81 32/81 16/81 


The ML estimate of p is then p = 1/3if0 <x < 1,p=2/3if3 <x <4. 
If x = 2, the likelihood function (binomial p.d.f.) has no maximum, so that 
the ML estimator is undefined in this case. 


x=0O0 x=1 x=2 x=3 
p 0.1 0.3 0.7 0.9 


From Theorem 10.2, Xy is such that a p(x; ,¢)dx =a. 

d£/di = n/i — >; ti: = 0, from which 1/4 =D, t;/n =m. 

(a) From Eq. (1.33), we have P(x; N) = A(x)[(N—n)!]7/[N!(N—2n4+x)!], 
where A(x) contains factors independent of N. Since P(x; N) = L(N), 
the study of the function for discrete values of N shows that L(N + 1) > 
L(N) until N < n?/x — 1. The maximum of L(N) occurs for N = 
1 + int(n?/x — 1), where int is the integer part of the argument. From the 
data, N = 609 is obtained. (b) Using Stirling approximation and Eq. G2, 
one obtains dL/dN = = —dInL(N)/dN = In[N(N — 2n + x)/(N — n)?], 
from which N = n?/x = 608. Since N is a discrete variable, the Cramér- 
Rao bound cannot be used. Moreover, Eq. (10.29) is difficult to apply, even 
using the Stirling approximation and assuming N as a continuous variable. 
The process can then be simulated by generating a series of x values with 
N = 600 and analysing the histogram of the estimated N. Repeating 
the experiment for 5000 times, we obtain an asymmetric histogram (with 
a long right tail) of parameters m = 618 + 1, 5s = 80.0 + 0.8. The 
maximum value is at N = 600. A fraction of 72% of the values is inside 


606 


10.6. 


10.7. 


10.8. 


10.9. 


10.10. 


10.11. 


D_ Solutions of Problems 


the interval 608 + 80. Hence, a reasonable estimate is N € 608 + 80, 
CL = 72%. Since x follows the binomial density, under the Gaussian 
approximation (x > 10, n—x > 10) Eq. (5.57) can be used, giving the result 
Var[N] = Var[n2/x] x (n4/x3)[ — x/n) + (2/x)(1 — x/n)?] (note that 
the second term is negligible). An interval N € 608 + 88 is thus obtained, 
in good agreement with simulation. 

The likelihood function of two trials is L = p™!+*2 (1 — p)?~@i+*2) = 
L(p; S). S is a sufficient statistic; P is not. In fact, p can only be estimated 
from the sum of the successes, and not from the product. 


Neglecting constant factors, InL = —(n/2)[Ino* + s*/o7], where o2 


v= w/n (the mean is known). Notice that (s?) = o* and that w is a 
sufficient statistic. By computing the first and second derivatives of In L, 
we find that the expected information is nI(0) = n/ (20%). The same 
result is obtained by applying the suggested methods (a) and (b), from 
[|In L|/VnT] < ty| and [|s* —0?|//(nI)~—!] < ty|, where ty is the selected 
quantile. The result is also identical to that of Eq. (6.70). 

Ifw =), a and n = 6, the logarithm of the likelihood obtained from 
the Gaussian with zero mean is In L(o?; w) = —(n/2) Ino? — w/(207). 
The maximum value is G2 = w/n = 39.0/6 = 6.5, which is the ML 
estimate of o*. To apply Eq. (10.45), it is necessary to redefine L, so that 
A[In L(G?; x)] = 0. One then obtains A[In L(o7; w)] = +(n/2)[Ino? + 
w/(no?) — (n/2)n(w/n) + 1]. The value CL = 95.4% is obtained when 
2A[In L(o?; x)] = 4. The numerical study of the function gives the interval 
o* & [2.5, 27.1] = 6.577". 

The x2 value (see Eq. (6.76)) is calculated with 6 degrees of freedom. From 
Table (E.3) the following values are obtained by interpolation: Me is (6) = 
2.46 and Kip) = 0.20. The confidence interval with CL = 95.4% is 
o” € [2.6, 32.6]. This estimate is better than the previous one, which holds 
only asymptotically. 

The artificial data of this exercise were generated from a Gaussian with 07 = 
9. 

The two methods give the same result numerically: 0? € [8.0, 14.0]. Also 
these artificial data were generated from a Gaussian with o7 = 9. 

The interval estimate of the two means is 4, € 2.08 + 0.016cm, p2 € 
2.05 + 0.011cm. The two means are compatible, because the difference 
test (7.11) provides a standard value tf = 1.58. The ML estimate is the 
weighted mean: {i = 2.0596. The interval estimate is given by Eq. (10.70): 
be € 2.0596 + 0.0091. 

If the random variable X is the number of negative samples over a total 
a priori fixed number of N examined samples, the likelihood function is 
binomial: L(y; x) = N!/[x!(N—x)!] [exp(—) }*[1 —exp(—p)]%~*, where 
exp(—j/z) is the Poissonian probability of observing zero events when the 
mean is 4. The ML estimate of w gives exp(—f) = x/N, from which 


D_ Solutions of Problems 607 


10.12. 


10.13. 


10.14. 


10.15. 


10.16. 


ft = In(N/x) = 1n(50/5) = 2.3. Compare this result with Table 6.2 and 
Eqs. (6.41). 

It is necessary to group the last three bins to have n(t) > 5: the result is 
ten events (times) between 14 and 20 s. The expected probability values 
within each bin At = (f1, f2) are given by p;(9) = pi) = ie e(t)dt = 
exp(—At,) — exp(—Atf2). By using Eq. (10.58), and a non-linear fit program, 
one obtains A € 0.337 + 0.011, x% = 9.55/7 = 1.36. The x? must be 
divided by 7 degrees of freedom, since a parameter estimated from the data 
(A) was used. The observed SL (p-value) is ~ 24%, by interpolation from 
Table E.3. From Eq. (10.60) one then obtains A € 0.339 + 0.011, xe — 
10.32/7 = 1.47, SL ~ 19%. The data were artificially generated from an 
exponential density with A = 1/3. 

Test level: a = P{X = 1; Ho} = e. Power of the test: 1 — 8B = 1 — P{X = 
0; My = P{X =1,Mjy=1-e. 

Test level: a = P{X, = 1, X2 = 1; Ho} + (1/2) P{X1 + X2 = 1; Ho} =e. 
Power of the test: 1— 6 = 1— P{X, = 0, X2 = 0; Mj}—(1/2) P{X14+- X2 = 
1; Hi} = 1 — «. The result is the same as in the single trial. However, if 
possible, it is advisable to carry out both tests, because in extreme cases the 
observed significance level and its power assume more favourable values. 
For example: a = P{X; = 1, X2 = 1; Ao} = e2,1— B=1- P{X,; = 
0, X; =0; Mj} = 1. 

Since p = (0.050 — 0.029) /(0.057 — 0.029) = 0.750, for S = 16, a number 
0 < random < | is generated, and the hypothesis is accepted if 0.75 < 
random < |. In practice, one lot out of four is accepted. 

The requested test levels are aw = 0.01 and f = 0.05. (a) If Ao = 1/100, 
Ay = 1/110 and t, = on t;)/n is the mean of the observed times, from 
the equations (t, — 100)/(100/./n) = 2.326 and (t, — 110)/(110/./n) = 
—1.645, the values n = 1710 and t, = 105.6 are obtained. It is necessary 
to allocate a sample of 1710 suspensions, to calculate the average of the 
downtimes and to accept the supplier’s declaration if the average exceeds 
105.6 h. 

(b) From the ratio: Lo/L1 = [Ao exp(—Aotn)]/[A1 exp(—A1t,)], and using 
logarithms, one can write the indecision interval (10.110) as —4.55/n — 
In(ag/Ai) < (Ay — Ad)tn < 2.98/n — In(Ag/A1). The hypothesis Hj (Ho) 
is chosen when the condition is unsatisfied to the left (right) tail. Simulation 
shows that, if the suspensions are really better, the mean of a sample with 
n € 934 + 6 suspensions and a sample mean of tf, € 112.7 + 0.1 hours 
are enough to confirm the manufacturer’s claim. If the quality of the new 
suspensions is the same as the old ones, on average, a sample of n = 670 
suspensions and an average of ft, € 97.1 + 0.1 hours is enough to take the 
right decision. Warning: these are average values, and it can happen, with 
a single test, to exceed the value n = 1710 previously obtained with the 
sample of fixed size and the normal approximation. The region of indecision 
is included, in the plane (, t,), between two hyperbolas. We suggest you to 
carefully examine the simulated histograms! 


608 


10.17. 


10.18. 


10.19. 


10.20. 


D_ Solutions of Problems 


It would seem convenient to start with the sequential test and use method 
(a) when the number of pieces exceeds 1710. However, this procedure is 
incorrect, because deciding which type of test to use after having obtained 
some (full or partial) results alters the a priori levels of probability. The type 
of test (a) or (b) must therefore be decided before performing the test. 

The variable Q,, = 2AinT;,, where T;, is the mean value of n exponential 
times, follows the x2(2n) distribution. One then has 1 — B(A;) = 1 — 
P{Q < q),}, where Q has x? (2n) distribution. Since n is large, the normal 
approximation can be used, and the Gaussian test on the mean is valid: 
1— Br) = 1- ©[(qau, — 20) /(2V)] = 1 — Olt — 1/1)/A/ Arn). 
Look at Fig. 10.6, and assume that the hypotheses Hp : po = 0.5 and 
AH : p; = 0.3 are associated to Gaussian densities. Since, from Table E.1, 
it results P{t < —(05—a) = —0.4} = —1.28, P{t > (0.5 — B) = 
0.4} = +1.28, and we have a one-tailed test (t4/2 — fy), it is enough to 
use Eq. (10.83) with |tg| = |tg| = 1.28, pp = 0.5 and p; = 0.3. One thus 
obtains n = 38. The critical value is x = 38(0.34 1.28./0.3 - 0.7/38) ~ 15. 
A sample of 38 elements is needed; if x < 15, Hj is accepted; otherwise, 
Ho is kept. 

The ratio (10.99) is R = Lo/Ly = (0.5% - 0.5"~*)/(0.3* - 0.7""*). From 
Eqs. (10.104) and (10.106), since (1 — w)/B = 9 anda/(1 — 6B) = 1/9, one 
obtains, passing to logarithms, the following result: p = 0.5 is accepted if 
x > 0.397n + 2.593, and p = 0.3 is chosen if x < 0.397n — 2.593. The 
plane band between these two lines is the uncertainty region. 

Repeating the test 10,000 times with the simulation of a fair coin (p = 0.5), 
we obtained, for the values, an exponential histogram with parameters 
m € 23.3+0.2 ando € 17.3+0.1. The wrong hypothesis has been chosen 
in 786 cases over 10,000. The average number of attempts (23) is smaller 
than the result of Problem 10.18 (38). However, in 1484 cases the simulated 
sequential test required a number of attempts > 38. 

One easily finds (X) = (6 + 1)/(6 + 2), Var[X] = (6 + 1)/[(@ + 2)?(@ + 
3)] and the cumulative function Fy (x) = x°t!. The Neyman-Pearson ratio 
evaluates the best critical region as {—Inrg + nIn[{(1 + 60)/C. + 41) ]}/ 
(0; — 9) < do; nx; < O.1f Z =X, F,(z) = P{Z <z}= P{X <e}= 
e'+9)<_ The p.d.f. of Z is then dF,/dz = pz(z) = (1 + 6) exp[(1 + 6)z], 
and hence (; In Xj) = —n/(1+ 0), Var[}>; InX;] = 1/1 + 6)”. Since 
n = 100, the normal approximation can be used; for a = 0.05, ti-a/2 = 
1.64, and, if 9 = 0 = 1, the null hypothesis rejection occurs for [}>; In. x; — 
(—n/2)]/(./n/2) = 1.64, that is for: )°; x; > —41.8. This test on }°; In x; 
has the maximum power for a = 0.05. It also results that rz = 3.49, but 
this value is not necessary for the test, if the normal approximation is used. 
The power curve is 1 — 6 = 1 — ®[(—41.8+n/(1 4+ 61))/(/n/C + 41))], 
6, > 09. Forn = 100 and 6; = 2, one has 1 — 6 = 0.996. It is useful to 
verify that the test on the sum > x; gives, under the normal approximation, 
only a slightly smaller power, 1 — 8 = 0.989. 


D_ Solutions of Problems 609 


Chapter 11 


11.1. 


11.2. 


11.3. 


11.4. 


11.5. 


11.6. 


From Eq. (5.65), under the hypothesis Cov[X, Z] = 0, one obtains o = 
b* Var[X] + Var[Z] + 2b Cov[X, Z] = b? Var[X] + Var[Z] . Defining 
AX = X—y, and AY = Y — py, the covariance between these variables can 
be computed from Eq. (5.83), as in Exercise 4.3: Cov[X, Y] = (AX AY) = 
b(A?X) = bo. We then obtain the result p = +o[f(X)]/olY] = 
box /oy = bo? /ox Oy = Oxy/(Ox Oy). 

From the relation: 5s, = 0.10x/ /12 (the same for y), and using the effective 
variance formula for the s¢ calculation, the following table is obtained: 


x 10 20 30 40 50 60 70 
Sy 0.3 0.6 0.9 1.2, 1.4 1.7 2.0 
y 21.4 388 522 881 995 1204 158.3 — 
Sy 0.6 1.1 1.5 2.6 2.9 3.5 4.6 


The minimization of Eq. (11.16), where f(x; a, b) = a+bx and the effective 
variance in the denominator is a + b?s2, gives the values a € —0.13+ 1.15, 
b € 2.04 + 0.05. With two iterations of the linear fit, one obtains a € 0.23 + 
1.12, b € 2.02 + 0.05. The data were generated with a simulation starting 
from Y = 1 + 2X. We also get x*(v)/v = 28.0/5 = 5.6. However, the x” 
test is not meaningful, because data are not Gaussian. 

A non-linear fit with the straight line, f(x; a,b) = a + bx, and effective 
variance s + b?s2 give the values a € —0.64+1.14,b = 2.1040.05. With a 
two-step linear fit, one obtains a € —0.62+ 1.12, b € 2.10+0.05. The same 
result is obtained also with the routine FitLineBoth. The data have been 
simulated starting from Y = 1 +2X. The x? test is meaningful, because data 
are Gaussian. One obtains x7(v)/v = 2.05/5 = 0.41, corresponding, from 
Table E.3, to a left-tailed SL ~ 20%. In this case, the straight line is a good 
model. 

The straight lines passing through (xo, yo) must satisfy the constraint yo = 
aj + bjxo. If a; and b; are known, it is convenient to represent them in the 
so-called dual plane, where a line is represented by a point of coordinates 
(b, a). In this plane, the straight line fit of the points (b;, a;) performed with 
the function a = (—x0)b + yo evaluates the vertex coordinates (xo, yo) and 
their error. In physics, this method is used to determine the emission vertex 
of electrically charged nuclear particles from their reconstructed trajectories 
inside specific detectors. In case of curved trajectories (as the ones inside 
magnetic fields), one can proceed by successive steps, using straight lines 
tangent to the particle trajectories. 

The data were obtained from a simulation assuming Y = 5 + 0.8X2 + Yr, 
where o [Yr] = 0.5. 

Both X and Y have o = 0.5, and a correlation f(X) = 5X + 0.2X? has been 
simulated between them. 


610 


11.7. 


11.8. 


D_ Solutions of Problems 


The error o[Y] is estimated as s(y;) = 0.10- y;. In this way the following 
values are obtained: 


x 2 4 6 8 10 
y 75 To 170° 255 238 
sy) 08 12 17 25 da 


These data are loaded into vectors x, y, sy and passed to the routine 
FitPolin(x=x,y=y,dy=sy,fitfun=y x,ww=’ABS’), which 
determines the estimates a and b of the functional relation y=at bx . The 
code provides the result @ + s(@) = 3.31 + 1.08, 6 + s(b) = 2.26 + 0.24, 
r(a, b) = —0.845, x? = 3.66. The LS estimates are compatible with the 
true values a = 5 and b = 2, used to generate the artificial simulated 
data. The reduced chi-square xe (3) = 1.22 corresponds, from Table E.3, 
to one-tailed significance level of about 30%. Their covariance, obtained 
from the covariance matrix provided by the routine, is s(4, b) = —0.22. It is 
very instructive to verify the fit result with the simulation method described 
in Sect. 8.10. Using the LS estimates a and b and the experimental data 
xj, we generate artificial data y; with errors s(y;) according to the algorithm 
y; = (1+0.10-&;) (a+b x;), s(y;) = 0.10-y!, where & is a standard Gaussian 
variate. Note that the straight line is calculated with the estimated values and 
that the error is always calculated in an approximate way as a percentage 
of the observed data, as in the case of a real ee Repeating the fit 
20,000 times, the histograms of a, b, s(a), s(b) and of x? are obtained. The 
histogram widths allow you to directly check that the errors of s(@) and s(b) 
from the fit coincide within 3 decimal digits with the standard deviations of 
the simulated histograms of a and b. The densities of the estimates of a and b 
are practically perfect Gaussian, as might be expected. The 20 000 simulated 
x? values are also perfectly distributed as the x? density with three degrees 
of freedom. 

At first, it is necessary to evaluate the errors s(y;) = (A/V12)y; = 
2(0.10/3.46)y; = 0.058 y; and, analogously, s(x;) = 0.058x;. Also in this 
case, the estimate is approximate, because it is obtained as a percentage of 
the observed values, instead of the true ones. Loading the values x;, y;, s(x;) 
and s();) to vectors x, y, sx, sy and defining the functions 
fun<-function(par,x) {par[1]+par[2]*x} and 
dfun<-function(par,x) {par [2] }, where par<-c(0.5,0.5), 
the routine FitLineBoth(x,y,sx,sy,par=par,fun=fun,dfun 
=dfun) is called, minimizing Eq. (11. 16). We obtain d+ s(a) = = 5.09+0.55, 
b+ s(b) = 2.17640. s(@,b) = —0.0616, r(@,b) = —0.842, 
x? = 3.22. The reduced x* value is x*/3 = 1.07, indicating a good 
agreement between model and data. The estimates are compatible with the 
true values a = 5 and b = 2, which are those used to generate the simulated 
data. As in the previous exercise, we now repeat the fit for 20,000 times. 


D_ Solutions of Problems 611 


Using the experimental data x; and the LS parameter estimates, we can 
generate artificial data x;, y; with errors s(x;) and s(y;) according to the 
algorithm y; = [1 + 0.10(1 — 2&)] (@+ bxi), x; =x, +0.10(1 — 2&) x, 
s(x}) = 0.058 x;, s(y;) = 0.058 y;. At each iteration, the variance se is 
computed, and the fit is performed. From this simulation we obtain that the 
standard deviations of the histograms of a and b c coincide with the values 
estimated by the LS method. This agreement increases our confidence in 
the least squares algorithms, even with non-Gaussian data. Simulation thus 
demonstrates an important fact: the deviations from the Gaussian model are 
small. Ultimately, a negligible error is introduced by applying the intervals of 
the 30 law to the regression result. 


Chapter 12 


12.1. of F 


12.2. 


12.3. 


12.4. 


12.5. 


12.6. 


12.7. 


12.8. 


12.9. 


]/F = V0.047 + 0.057 = 0.064, that is, 6.4%. 

Passing to logarithms, one has t = —t In(//Io) = 23 days. Propagating the 
error on J, one obtains s; = +ts;/I = +0.5 days. 

Since the process is Poissonian, the experimental count number is an 
estimate of variance. From the error propagation law, the background value 
is F = 620/10 = 62 counts/s, with o[ F] = 620/10 = 2.5 counts/s. Then 
U — F)+o[I — F] ~ (157 — 62) + ./157 + 620/100 = 95 + 13, counts/s, 
CL ~ 68%. The signal to background ratio is “7 sigma”: ng = 95/13 ~ 
7.3. 

The upper limit with CL = 95% is given by the equation 25+ 1.65,/u = wu, 
from which 4 = 34.7 ~ 35. The upper limit for the signal is then 35 — 
10 = 25 counts/s. Note: it is important to subtract background after the 
limit calculation. 

Student’s quantiles with 3 degrees of freedom must be associated to these 
four values. From Table E.2 (and for a probability of 0.975), it results that 
the statistical error of the mean is 0.04/3.18 = 0.012. The precision is then 
0.012./4 = 0.02. 

The percentage error of R propagates on V twice as much as the percentage 
error of L. Therefore, it is better to improve the measurement of R. 
Var[sen@] ~ (cos? 4) s from which sen@ = 0.50 + 0.03. Remember to 
transform the angle values to radians! 

It is immediate to derive that (1 — P?)/N = (4N,N_)/N?. If N is arandom 
Poissonian variable, N+ are Poissonian independent random variables, 
s2(Ni) = Nx, and, since (0P/ON+) = +2 N=/N?, it follows that the 
error propagation procedure gives s*(P) = (4N,N_)/N?. If N is fixed 
before the measurement, N+ are values of correlated binomial variables. 
Since NL = N — Nj, one has P = 2(N,/N) — 1, 0P/ONy = 2/N, 
s?(N1) = Ni(1 — N,/N) and hence s?(P) = 4(N,/N7)(1 — Ny/N) = 
4N,N_/N?. In conclusion, the uncertainty s(P) = \/(1 — P2)/N holds 
for any experimental condition. 

f(V) = 2.718 + 0.008, where the error is the standard deviation. The 
distribution of f(V) is nearly uniform, as it is easy to verify both analytically 


612 


12.10. 


12.11. 


12.12. 


12.13. 


12.14. 


12.15. 


D_ Solutions of Problems 


and with a simulation. Then, it is possible to write f(V) = 2.718 + 
0.008 12/2 = 2.718 + 0.013, CL = 100%. 

Since s?(E2) = (0.05)?(0E2/8E1)?+ (100//12)?(8 E2/8R1)?+ 
(100/./12)? (8E2/9R2)* = (0.103)?, one has Ex = 5.00 + 0.10 V. The 
simulation randomly generates uniform variates E), Ri, Ro (e.g. Ri = 
1000 + (2 rndm — 1) 50) and provides the histogram of F2. A distribution 
close to the triangular p.d.f. is obtained, with a standard deviation s = 0.103, 
coincident with the one calculated analytically. The interval m + s includes 
65% of the values. The CL = 68% is achieved for s = 0.110. The direct 
measurement of E> is in agreement with this calculation. 

v = 2.00/5.35 = 0.3738 m/s. 55 = 0.05//20 = 0.011 s; 5) = 
0.002/./12 = 5.81074 m. Since s2 = v*[s?/1? + s2/t?] = 6.21077, 
one obtains v = 0.3738 + 0.0008 m/s. The error is dominated by the time 
uncertainty. Since v is the ratio between a uniform and a Gaussian variable, 
it is better to verify the CL with a simulation. A Gaussian histogram is 
thus obtained, with standard deviation coinciding with the calculated one. A 
CL ~ 68% can then be associated with the result. 

In this case, Var[Y;] = oa? + en and Cov[Y;, Yj] = 7 x;x; fori ~ j. One 
then obtains: 


a? + e7x? €7x 1x2 sath ae ae 
ve €2x1x2 of + e7 xh site €2x2Xp 
€2x1Xp €2x2Xp bas oO + ea 


We can generate 10,000 standard Gaussian variates and create an histogram 
to calculate their mean and variance. The interactive code lines are 


x<-rnorm(10000); fx<-hist (x) $Scounts 
xbin<-hist (x) Smids 

meanx=sum (£x*xbin) /sum(f£x) 

varx=sum ( (xbin-meanx) “2«*f£x) /sum(£x). 


It is better to verify the CL with a simulation. A Gaussian histogram is 
thus obtained, with standard deviation coinciding with the calculated one. A 
CL ~ 68% can then be associated with the result. 

Using the solution of the Problem 5.7, it results that the variable Z = XY 
has logarithmic density pz(z) = — nz, with mean and standard deviation 
pwto =0.25 + ./(0.49). The probabilities can be found by integrating the 
density in the intervals (u% + Ko), where K is a real number. Integrating 
between the limits a = max(0, uw — Ko) and 8 = min(1, w+ Ko), one 
obtains Aha —Inz = [z—zInz]f = B — Bln(B) — a + aln(a@); hence, 
P(Z < |u— Kol) = 0.689, 0.946, 1 for K = 1, 2,3. Even though the 
distribution is logarithmic, the levels are close to those of the 30 law. 

The frequency is obtained by the formula f = c/(ab), where c = 1750, 
a = 0.3 mand b = 0.5 m. The corresponding errors are og = V1750 = 44, 


D_ Solutions of Problems 613 


12.16. 


12.17. 
12.18. 


12.19. 


Og = op = 2-0.005/V12 = 0.00289 m. From the error propagation, one 
has: 


oF; =[1/(ab)’o7 + [e/(a°b) Po; + [c/(ab*) a7, = (306) , 


from which: f = 1750 + 306 counts/(m7s). 

If we assume to have the values of x and n, of Eq. (12.68) loaded into the 
vectors x and nx, together with a vector of errorss <- sqrt (nx),theR 
instructions are db<-data. frame (x,nx,s); 

class(fitexp <- nx ~ sum(nx) * (mu‘x/factorial (x) ) 
*exp (-mu) ); 

sv<-list (mu=5.5); 

Nlinfit (nlfitfcn=fitexp, database=db, startvalues=sv, 
weight=’ABS’). 

lo, ~ 202 — 52 ~ 19 bins, from convolution properties. 

The fit result is @ = 1.86+0.58; 6 = 0.38+0. 10; it is distorted with respect 
to the result of Table 12.1. 

It is statement (b), because it is falsifiable. 


Appendix E 
Tables 


E.1 Integral of the Gaussian Density 


The standard Gaussian density, described in Sect. 3.5, is given by: 


2 
g(t; 0,1) = = exp (-5) : (E.1) 


The values of the integral probability: 


t 
E(t) = J g(t; 0, 1) dr (E.2) 
0 


to obtain, in a random sampling, a value inside the interval [0, ¢] are reported in 
Table E.1 for t € [0.0, 3.99] in steps of 0.01. The first two digits of t are read in the 
first bold column on the left, the third digit in the bold row at the top. The values of 
the integral (E.2) are read at the intersection of the rows and columns. By exploiting 
the symmetry of E(t) around the value t = 0 (see Eq. (3.40)) and using this table, it 
is possible to calculate the cumulative integral within any interval. 

The integral (E.2) can also be obtained, with the R instruction pnorm(t) -0.5 
for any positive t; hence, the same can be done for all the values in the table. In 
particular, ¢;_./2 is found by looking for 1 — a/2 —0.5 in the table and its matching 
t value. For example, 1 — a = CL = 0.99% corresponds to 1 — a/2 = 0.995, so 
1 —a@/2—0.5 = 0.495 ~ 0.4949 and the matching f is 2.57. 

We recall that the functions of R which compute the fundamental densities 
tabulated in this Appendix are listed in Table B.2. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 615 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per il 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3 


616 E_ Tables 
E.2 Quantiles of the Student’s Density 


The quantile values tp of the variable 7, corresponding to different values of the 
integral probability not to exceed a given value tp: 


tp 


P=jPT St) = i sy(t) dr, (E.3) 


where s,(t) is the Student’s density (5.42), are reported in Table E.2 for different v 
values. This distribution has been obtained in Exercise 5.5. 

The last row of the table, where v = 00, gives exactly the same quantile values 
of the Gaussian density. 

Quantile values corresponding to a cumulative probability p for df degrees of 
freedom can also be obtained with the R instruction qt (p, df). 


E.3 Integrals of the Reduced x2 Density 


The reduced chi-square p.d.f. p, ( x?) of the variable Q(v) with v degrees of freedom 
is given by Eq. (3.67). 

The values of the variable Qr(v) = Q(v)/v, corresponding to different values 
of the integral probability: 


P= POR) > xb =f polxd axe (E4) 


Xp 


to exceed a given value of reduced x”, are reported in Table E.3 for v values between 
0 and 100. When v > 100, p,( x2) tends to a Gaussian with mean and standard 
deviation 4 = 1,0 = ./2/1, respectively. 

The table values corresponding to a probability p for df degrees of freedom can 
be also be obtained, with the R instruction qchisq(1-p,dEf) /dE. 


E.4 Quantile Values of the Non-Reduced x? Density 


Table E.4 gives the quantile values of the non-reduced x? density: 


2 


Xx 
P= P< 90) <x) | pulx2) dx? . 


These values can be used for the x test instead of Table E.3. The use of Table E.3 
of reduced x* for given significance levels is equivalent to the use of Table E.4 of 


E_ Tables 617 


the quantile values of non-reduced x7. The choice of one or the other depends on 
the preferences of the reader or on the type of problem. 

When v > 100, the x7(v) density tends to be Gaussian distributed with mean 
and standard deviation given by 1, = v ando = V2, respectively. 

The quantile values of Table E.4 are to be preferred for the calculation of the 
confidence levels corresponding to the value of Ax? in the equation: 


4° = oe + Ax? ’ (E.5) 


where x, in 18 calculated with the true (or best fit) parameter values of the model. In 
this case the degrees of freedom v represent the number of x7 free parameters. 

For example, if x7 has ten free parameters, the concentration ellipse that has 
A x2 ~ 16 as boundary contains 90% of the values, that is, it includes the 90% of 
hyperspace, corresponding to CL = 0.90. 

The quantile values corresponding to a probability p for df degrees of freedom 
can be also obtained with the R instruction qchisq(p,df). 


E.5 Quantiles of the F Density 


The p.d.f. py,» of the Snedecor’s F variable has been obtained in Exercise 5.6, 
Eq. (5.46). 

The probability to obtain a value {F < Fy} ina random sample with 1; and v2 
degrees of freedom is given by the integral: 


F. 


P{F < Fy} -|/ ” Du» (FGF . (E.6) 
0 


The values of Fy, corresponding to right-tailed significance levels of 5% (a = 0.95) 
and 1% (a = 0.99), are shown in Tables E.5 and E.6, respectively, for different 
values of v; and v2. The left-tailed Fj, values can be obtained from Eq. (5.49): 


1 


——_——_—_—__.. E.7 
F\_a (v2, v1) en) 


Fa (vy, v2) = 


The quantile values corresponding to a probability p for d£1 and df2 degrees 
of freedom can be also obtained with the R instruction gf (p,df1,df2). 


618 E_ Tables 


Table E.1 Integral of the standard Gaussian 
g(t; 0, 1) as a function of the standard 
variable t 


10) 

¢ [0.00 [oor [002 [003 [004 [005 [0.06 [0.07 [0.08 [0.09 

00 0.0319 | 0.0359 
0. 0.0714 |0.0753 
02 0.1103 0.1141 
03 0.1480 |0.1517 
0.4 0.1844 | 0.1879 
05 0.2190 [0.2224 
0. 0.2517 |0.2549 
07 0.2823 | 0.2852 
08 03106 (0.3133 
09 0.3365 | 0.3389 
1.0 0.3599 (0.3621 
il 0.3810 0.3830 
12 0.3997 | 0.4015 
13 0.4162 0.4177 
14 0.4306 0.4319 
15 0.4429 (0.4441 
1.6 0.4535 | 0.4545 
17 0.4625 | 0.4633 
18 0.4699 | 0.4706 
19 0.4761 (0.4767 
20 0.4812 | 0.4817 
21 0.4854 | 0.4851 
22 0.4887 | 0.4890 
23 0.4913 /0.4916 
2.4 0.4934 0.4936 
25 0.4951 (0.4952 
2.6 0.4963 | 0.4964 
27 0.4973 0.4974 
238 0.4980 (0.4981 
29 0.4986 | 0.4986 
30 0.4990 0.4990 
31 0.4993 | 0.4993 
32 0.4995 | 0.4995 
33 0.4996 | 0.4991 
3. 0.4997 | 0.4998 
35 0.4998 | 0.4998 
3.6 0.4999 | 0.4999 
37 0.4999 0.4999 
3.9 0.5000 | 0.5000 


E_ Tables 619 


Table E.2 Quantile values tp of the Student’s f vari- 
able for v degrees of freedom 


Pp 


P 
060 [0:70 [075 [O80 [ORs [090 [0.95 [07S [099 [0.995 [0.9995 


Vv 

l 63.65 | 632.0 
2 9.925 | 31.60 
3 | 0.277 |0.584 |0.765 | 0.978 | 1.250 | 1.638 |2.353 |3.182 | 4.541 | 5.841 | 12.92 
4 0.941 | 1.190 | 1.533 | 2.132 |2.776 | 3.747 | 4.604 | 8.610 
5 0261 0599 [0727 |os20 [1.136 | La |2015 [2571 3.365 4002 6.869 
6 0265 oss [ozi8 0.906 |1.134 | 1.440 138 3.707 | 5.959 
7 0.711 3.499 | 5.408 
8 0.706 ose [1105 [1.367 [1.860 [2.306 [2.6 [335 5.041 
9 | 0.261 | 0.543 | 0.703 | 0.883 | 1.100 | 1.383 | 1.833 3.250 | 4.781 
10 | 0.260 | 0.542 | 0.700 3.169 | 4.587 
11 | 0.260 | 0.540 | 0.697 | 0.876 | 1.088 | 1.363 | 1.796 2201/2718 (3.106 4.437 
12 | 0.259 | 0.539 1.782 3.055 | 4.318 
13 | 0.259 | 0.538 1.771 3.012 | 4.221 
14 | 0.258 | 0.537 | 0.692 | 0.868 | 1.076 | 1.345 | 1.761 2.977 | 4.140 
15 | 0.258 | 0.536 1.753 2.947 | 4.073 
16 0258 0525 [0690 loses [Lon [1.397 [174 2.921 | 4.015 
17 | 0.257 | 0.534 1.740 2.110 | 2.567 [2.898 3.965 
18 | 0.257 | 0.534 1.734 }2.101 | 2.552 | 2.878 | 3.922 
19 | 0.257 | 0.533 | 0.688 | 0.861 | 1.066 | 1.328 | 1.729 Beoa REA ES 3.883 
20 | 0.257 | 0.533 1.725 2.845 | 3.849 
21/0257 [0.532 [0.686 [0.859 | 1.068 | 1.323 | 1721 2.831 | 3.819 
22 [0.256 | 0.532 0.686 | 0.858 | 1.061 |1.321 [1.717 2.819 | 3.792 
23 | 0.256 | 0.532 | 0.685 | 0.858 | 1.060 | 1.319 | 1.714 2.807 | 3.768 
24 | 0.256 | 0.531 | 0.685 | 0.857 | 1.059 | 1.318 | 1.711 2.797 | 3.745 
25 | 0.256 | 0.531 1.708 2.787 | 3.725 
26 | 0.256 | 0.531 1.706 2.779 | 3.707 
27 | 0.256 | 0.531 | 0.684 | 0.855 | 1.057 | 1.314 | 1.703 2052/2473 2.7m 3.690 
28 | 0.256 | 0.530 1.701 2.763 | 3.674 
29 | 0.256 | 0.530 | 0.683 | 0.854 | 1.055 | 1.311 689 2.756 | 3.659 
30_| 0.256 | 0.530 2.750 _| 3.646 
40 2.704 | 3.551 
50 | 0.255 | 0.528 | 0.679 | 0.849 | 1.047 | 1.299 | 1.676 2.678 | 3.496 
60_| 0.254 | 0.527 | 0.679 | 0.848 | 1.045 | 1.296 2.660 | 3.460 
70 | 0.254 | 0.527 | 0.678 | 0.847 | 1.044 | 1.294 | 1.667 2.648 | 3.435 
80 _| 0.254 | 0.526 | 0.678 | 0.846 | 1.043 | 1.292 2.639 | 3.416 
90 | 0.254 | 0.526 | 0.677 | 0.846 | 1.042 | 1.291 66 [1987 2368 | 2.632 3.402 


1.984 | 2.364 | 2.626 | 3.390 
1.960 | 2.326 | 2.576 | 3.291 


= 
S 
oO 
oO 
ie) 
Nn 
BK 
oO 
Nn 
N 
a 
oO 
lon 
~ 
a 
oO 
oe) 
aS 
n 
= 
lan) 
a 
is) 
— 
N 
No} 
oO 
- 
a 
Dp 
oO 


oO 
NO 
n 
res) 
oO 
in 
N 
aS 
oO 
lon 
~~ 
KR 
als 
oe) 
aS 
iw) 
= 
oO 
ies} 
an 
a 
NO 
fore) 
is) 
a 
RK 
n 


620 E_ Tables 


Table E.3. Values of the x7/v variable having a prob- 
ability P to be exceeded in a sampling 


P 
v | 0.005} 0.01) 0.025} 0.05 pO.10) 025} 0201 0.75] 050! o5-| 07s] 028 | 0.995 
1_| 7.88 5.02 0.000) 0.000 
2 3.69 20139 29/0408 | 08 | 0.01 | 0.005 
3 | 4.28 | 3.78] 3.12 | 2.61 a 0.04 | 0.02 
4 | 3.71 2.37 0.07 | 0.05 
5 | 3.35 | 3.02| 2.57 | 2.21 0.11 | 0.08 
6 | 3.09 2.41 ziof 7 0 | os- a7 oar oat 0.15 | 0.11 
7 | 2.90 2.29 | 2.01 0.18 | 0.14 
8 | 2.74 2.19 17128 90/0 034 [027 0.21 | 0.17 
9 2.11 ist 121 99/06 04g oa 0.23 | 0.19 

0.26 | 0.22 


ele 
mt >) 
iS) 
in 
N 
N 
io 
is) 
N 
oO 
a 
- 
0 
i) 
- 
a) 


i) 

h 

w 

i) 

N 
coln 
Laan |e | 
\o}\o 
iO 
mele 
Na}Na 
WM} oO 
mele 
nln 
nn 


[oan ast aa 0.28 | 024 
030 | 0.26 
032 | 027 
033 | 029 
a2} seh o7al os ove [oar [035 [Or 
“121 096) 0.74) 0.58) 0.50 | 043 | 036 | 0232 
121] 0.96) 0.75] 0.59) 051 | 044 | 038 | 034 
41.0 9607600 nat | 046 039 | 035 


mele} ele 
MW) B&B) Ww) bh 
NyhMyeyyye 
FINN | w 
Ol RISOLD 
NyHyMysyt 
Slole]H 
Boo} w 
all |e |S 
foo Ne) 
ea) 
Coal geal aoe 
DBAlaln 
A) oO} 
Coot geal toe 
BlA|n 
Noa Roa a) 


eile 
sO 
NIN 
Rie 
of 
al eg 
ela 
a) 
Rie 
~~) 2 
alo 
=f 
ala 
NIA 
nl ton 
AIA 
ain 


oo 
wo 
oS 
lon 
= 
No} 
Oo 
aN 
~~ 
n 
= 
a 
oO 
= 
n 
K 


19 | 2.03 173 | 1.59] 143) 1.20) 0.97) 0.77 0.61 053 | 0.47 | 0.40 | 036 
20 | 2.00 171 57] 1.42 1.191 0971 0.7] 0.62) 0.54 | O48 0.41 | 0.37 


N 
i 
— 
\o 
~ 
e 
oe} 
nn 
e 
D 
\o 
- 
a 
ion) 


141 0.42 | 0.38 
4 0.43 | 0.39 
0.44 | 0.40 
0.45 | 0.41 
117097 ee ose 0.46 | 0.42 
0.47 | 0.43 
“1.17/ 0.98) 081) 0.67| 0.60 | 0.54 | 0.48 | 0.44 
17] 058} 0.811 O68! 0.60 | 0.35 | 0.48 | 0.45 


NININ 
Blwl)dy 
RelRelRe}e 
C}\O}\O/} oO 
APOIWI Nn 
Ri Rl Re} Re 
~I}~)| 00) 0 
SO} rR | w 
Riel ele 
DIDID AD 
WI RIAN 
mele fe[e 
nl nlntln 
ein |/o/;s 
me felele 
to | Go| bo 
cO|a}o1o 


NIN 
nin 
e 
Co 
a 
ra 
~ 
ion 
me 
an 
ean 
- 
a 
Oo 
- 
ioe) 
~ 


N 
~ 
i 
oo 
K 
i 
~ 
ay 
Lemna 5 
D 
Oo 
_ 
KR 
\o 
— 
io 
lon 


iS) 
eo 
an 
oo 
N 
_ 
N 
iS) 
a 
in 
Ne} 
= 
nN 
oo 
= 
io 
n 


29 LT] 1.58 L391) 09 naa ot ose 0.49 | 0.45 
30 1,70| 1.57 il it | 095098 a8 62 [050 0.50 | 0.46 


& 
oO 
an 
lon 
ae) 
= 
in 
Ne} 
Lag 
nN 
oo 
= 
ies) 
Ne} 
= 
ioe) 
oO 


[14] 09s] o8e[ 073 066 [061 0.55 | 0.52 
13] 099/06) 075) 0.70 [ 0.65" 0.59 | 0.56 
L-11| 099] 088) 0.79] 0.74 | 0.70 | 0.65 | 0.62 


wn 
Oo 
— 
Nn 
\o 
e 
Nn 
wo 
a 
KR 
we 
= 
bo 
nn 
= 
iw) 
a 


>! 
oO 
= 
nh 
\o 
= 
BR 
(us) 
an 
es) 
Oo 
= 
N 
Ne} 
= 
iy 
i) 


80 i2i[ 110[ 055/049] oao/o75 [or [or [a 
90 1.20) 1.10] 0.99) 0.90) 0.81| 0.77 | 0.73 | 0.69 | 0.66 
100} 1.40 | 1.36] 1.30 | 1.24] 1.18] 1.09] 0.99] 0.90) 0.82| 0.78 | 0.74 | 0.70 | 0.67 


E_ Tables 


Table E.4 Quantile values x2 of the non-reduced x? 
variable for v degrees of freedom 


621 


0 x 


< 


l 006 | 0.15 | 027 | 045 | 0.71 | 107 | 164 | 271 | 3.84 | 5.02 | 6.63 
2 
3 | 1.00 | 142 | 187 | 237 | 295 | 367 | 404 | 6.25 | 782 | 9.35 | 1134 
4 | 165 | 2.19 [2.75 | 336 | 404 | 488 | 599 | 7.78 | 949 | 1114] 13.28 
5 aa [300 [3.66 [435 [5.13 [606 | 728 | 9.24 | 11.07 | 12.83] 15.09 
6 
7 
8 11.08) 13.36) 1551 | 17.53) 20.09 
9 sae [630 [736 34 [941 | 16d 1224) 1468/1692 | 1902] 71.67 
io eae | 27 | 830-[ 9:36 | voa7| Ti af asa) isso 1831 [208 25271 
u 659 | 8.15 | 9:24 | 10-54 1155] 12.901 14.65] 17.27] 19.68 | 21.89 24.72 
12 281 | 9.03 | vo.tel 114] 12.56] 14.011 15.81] 18.5 21.03 | 2. 26.22 
13 reer | 995 [1115/1234] 13.68 151 1698 19 sai] 2236 | 24.74) 27.69 
14 | 9.47 | 10.82] 12.08) 13.34] 14.69] 16.22] 18.15| 21.06] 23.68 | 26.12| 29.14 
15 | 10.31] 11.72] 13.03] 14.34] 15.73] 17.32] 19.31] 22.31] 25.00 | 27.49] 30.58 
16 15.34) 16.78] 18.42) 20.47| 23.54] 26.30 | 28.85| 32.00 
17 | 12.00) 13.53) 14.94) 16.34] 17.82) 19.51/ 21.61) 24.77| 27.59 | 30.19] 33.41 
18 | 12.86) 14.44] 15.89] 17.34] 18.87] 20.60] 22.76| 25.99| 28.87 | 31.53] 34.81 
19 | 13.72) 15.35] 16.85} 18.34] 19.91] 21.69) 23.90) 27.20| 30.14 | 32.85) 36.19 
20 | 14.58] 16.27] 17.81] 19.34] 20.95] 22.77| 25.04] 28.41] 31.41 | 34.17| 37.57 
21 | 15.44] 17.18] 18.77) 20.34) 21.99] 23.86| 26.17) 29.62) 32.67 38.93 
22 me 21.34} 25.051 2494 27 20} 20.81 33.92 
23 | 17.19] 19.02} 20.69] 22.34) 24.07] 26.02| 28.43) 32.01| 8a 38.08) 41.64 
24 “Tg 1994) 21.68) 7334) 25.11/77.10) 2958/3830) 364 39.36] 42.98 
25 18.94} 20.87| 22.62 2434 26.14] 28.171 30.68) 34.38} 37.65 | 40.65/ 44.30 
26 19:82 21.79] 25.58] 25.34 27.18} 29.25] 31.75] 35.56] 38.89 | 41.92| 45.64 
27 Eee Ee a EEE TCE TEE 
28 | 21.59] 23.65) 25.51 21-34 29.25} 31.39] 34.03 3792} 40.34 | 44.46| 48.28 
29 45.72| 49.59) 
30 25.51) 27.44) 29.34] 31.32 esl ase 43.77 | 46.98| 50.89) 
40 | 32.34) 34.87) 37.13| 39.34] 41.62) 44.16) 47.27| 51.81] 55.76 | 59.34] 63.69 
50 49.33 58.16) 63.17| 67.50 | 71.42] 76.15| 
60 | 50.64] 53.81) 56.62] 59.33) 62.13) 65.23) 68.97| 74.40] 79.08 | 83.30] 88.38 
70 63.35, 66.40, 69.33] 72.36] 75.69) 79.71) 85.53] 90.53 | 95.02] 100.4] 
80 | 69.21) 72.92) 76.19) 79.33] 82.57] 86.12] 90.41] 96.58] 101.88] 106.6) 112.3) 
90 78:56 82.51| 85.99] 89.33) 92.76] 96.52} 101.1} 107.6) 113.2 | 118.1] 124.1 


\o 
N 
e 
we 
\o 
Nn 
ioe} 
ro 
Ke) 
\o 
ww 
[e*) 
— 
o 
Loe) 
So 
ee 
S 
oa 
\o 
ee 
e 
— 
~ 
i 
oo 
Nn 
i 
i) 
ASS 
Ww 
i 
N 
\o 
n 


P 
[sao [030 [040 [0x0 [oes [or [os [as [095 [osm 08 [om 


7.88 

10.60 
12.84 
14.86 
16.75 
18.55 
20.28 
21.95 
23.59 
25.19 
26.76 
28.30 
29.82 
31.32 
32.80 
34.27 
35.72 
37.16 
38.58 
40.00 
41.40 
42.80 
44.18 
45.56 
46.93 
48.29 
49.64 
50.99 
52.34 
53.67 
66.77 
79.49 
91.95 
104.2 
116.3 
128.3 
140.2 


622 E_ Tables 


Table E.5 95% quantile —_ values of the 
Snedecor’s F (v1, v2) variable 


vy 
m | 1[2 [3 [4 [6 [8 [10 [15 [20 [30 [40 [60 | 120] 


1 258 254 
2 19.5 
3 9.28 8.53 
4 6.94) 6.59 SOCOM CIE EE MEAECIED 5.63 
5 5.79 4.37 
6 | 5.99] 5.14| 4.76] 4.53 3.67 
7_| 5.59] 4.74] 4.35 3.23 
8 4.07 2.93 
9 | 5.12] 4.26 3.63 2.71 
10 SRE 2.54 
11 | 4.84) 3.98 2.40 


pois 
0 
\o 


— 

w 

BB 

ain 

ain 

ww 

oo 

e 
elol\O/\Yy) wo) BR) BR]! NY] 
DIO} H]oO; Bl R|o;loyrRr| an 

w 

m 

oo 


w 
iy 
On 


2.30 
221 
2.13 
207 


KR 
- . 
D 
oO 
w 
— 
BK 
w 
ps 
= 


sad ey 
DQ 
ioe) 
Ww 
o 
a 


16 | 4.49) 3.63] 3.24 3.01 201 
7 3.59 1.96 
18 3.16| 2.93 1,92 
19 | 4.38 1.88 
20 2.87 raul 2i8 28) 20218200 99 95) ta 18d 
21 307] 288] 257/242) 222|218| 2.10 201/ 196| 1.92] 187/181 
22 | 4.30] 3.44] 3.05] 2.82] 2.55 1.78 
23 | 4.28) 3.42| 3.03 saol Deal Daal oor otal asl tect Tot Tey La 1.76 
24 | 426| | 301] a7s| 251[ 226] 225[2.11| zoo] 196| 198| 1.) 1.9L 
25 | 4.24 2.76 171 
26 | 4.23 zal 47] 2.32] 229{ 207] 199} 190 185] 190] 175] 1 
27 | 4.21) 3.35| 2.96) 2.73 246] 2.31] 2.20| 2.06] 87} 1.8) 184 1.79} 178 1.67 


i) 
foe} 
& 
NO 
=) 
we 
Ww 
BR 
N 
vo 
Nn 
N 
Na 
a 


1.65 
1.64 


N 
\o 
& 
— 
oo 
3 
oO 
N 
to 
WwW 
N 
a) 
So 


30 2.69 1.62 
40 11 
60 400 3182, 2.76| 2.53 139 


N 
oO 
ve 
eS 
w 
gS 
N 
a 
i) 
Nu 
iN 
n 


1.25 
237 1.00 


3 
Lae 
roa) 
mS 
alse 
Nu 
a 
oO 


E_ Tables 623 


Table E.6 99% quantile values of the 
Snedecor’s F(v1, v2) variable 


F 


vy 
[2 [3 [4 [6 [8 |1 | 15 [20 [30 [40 [60 | 120] 

6366 
99.5 
26.1 
135 
13.3 9.02 
6.88 
9.55| 8.45 785) 7196 6.84 62] 631/ 6.6 5.99| 5.91| 5.82) 5.74] 5.65 
7.01 6.08 4.86 
[ras sa 699] aaa] 5.0 a 526 4.31 
10 1.56 3.91 
ul 3.60 
12 3.36 
13 3.82 3.17 
4 3.00 
15 4.00| 3.81] 3.52] 3.37 = 2.87 
16 3.89| 3.69) 3.41) 3.26 2.75 
V7 3.79 3.16| 3.00) 29 2.65 
18 3.71| 3.51| 3.23| 3.08| 2.92| 2.84 2:57 
19 3} 3.43} 3.15} 3.00| 2.844 2. 276 2.49 
20 3.09{ 2.941 2.78 2.42 
21 2.36 
2 231 
23 2.26 
24 221 
25 2.27] 2.17 
26 2.23] 2.13 
27 2.10 
28 2.06 
29 2.08 
30 296] 270/255) 239/29] 291/211) 201 
40 es 2.80| 2.52 237] 2.20} 2.111 2.02 1.92) 1.80 
60 BECEC 2.3 1.73| 1.60 
1296.85] 4.79] 395] 3481 2.96 2.66 247} 2.19 2.03 1.86| 1.76) 1.66 1.53| 1.38 
= 2.51] 2.3 88] 1.70] 1.59] 1.47] 1.32] 1.00 


S 


OlWIAIL AD! NIB] wWlwple 
su 


in 


Nn 
N 
iS) 
Oo 


is) 
uN 
i) 
x 
= 


Bibliography 


[At 03] S. Agostinelli et al. Geant4-a simulation toolkit. Nuclear Instruments and Methods A, 
506:250-303, 2003. 
[AAN*07] V.S. Anishchenko, V. Astakov, A. Neiman, T. Vadivasova, and L. Schimansky-Geier. 
Non linear Dynamics of Chaotic and Stochastic Systems. Springer, 2007. 
[Azz96] A. Azzalini. Statistical Inference Based on the Likelihood. Chapman & Hall/CRC, Boca 
Raton, Florida (USA), 1996. 

[Bar89] R. Barlow. Statistics. J. Wiley and Sons, New York, 1989. 

[BCDO1] L.D. Brown, T.T. Cai, and A. DasGupta. Interval Estimation for a Binomial Proportion. 
Statistical Science, 16:101—133, 2001. 

[BFS87] P. Bratley, B.L. Fox, and L.E Schrage. A Guide to Simulation. Academic Press, New 
York, second edition, 1987. 

[BH95] Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: a Practical and 
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B, 
57:289-300, 1995. 

[Blo84] V. Blobel. Unfolding Methods in High-Energy Physics Experiments. Technical Report 
DESY 84-118, DESY, Amburgo, 1984. 

[BR92] P.R. Bevington and D.K. Robinson. Data Reduction and Error Analysis for the Physical 
Sciences. McGraw-Hill, New York, 1992. 

[Brol3] W.A. Brown. The Placeo Effect in Clinical Practice. Oxford University Press, Oxford, 
2013. 

[BS91] M. Berblinger and C. Schlier. Monte Carlo integration with quasi-random numbers: 
some experience. Computer Physics Communications, 66:157—166, 1991. 

[Buc84] S.T. Buckland. Monte Carlo Confidence Intervals. Biometrics, 40:811-817, 1984. 

[Bun86] B.D. Bunday. Basic Queueing Theory. Edward Arnold, London, 1986. 

[Car85] J. Carlson. A double blind test for astrology. Nature, 318:419-425, 1985. 

[CB90] G. Casella and R.L. Berger. Statistical Inference. Wadsworth & Brook-Cole, Pacific 
Grove, 1990. 

[Cha75] G.J. Chaitin. Randomness and Matemathical Proof. Scientific American, 232:47-53, 
May 1975. 

[Coc77] W.G. Cochran. Sampling Techniques. J.Wiley and Sons, New York, third edition, 1977. 

[Col83a] UA1 Collaboration. Experimental Observations of Lepton Pairs of Invariant Mass 
around 95 GeV/c at the CERN SPS Collider. Physics Letters B, 126:398-410, 1983. 

[Col83b] UA2 Collaboration. Evidence of Z° —> e+ e~ at the CERN Collider. Physics Letters B, 
129:130-140, 1983. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 625 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per i] 3+2 
139, https://doi.org/10.1007/978-3-03 1-09429-3 


626 Bibliography 


[Com91] A. Compagner. Definitions of randomness. American Journal of Physics, 59:700-705, 
1991. 

[Cou95] R.D. Cousins. Why isn’t every physicist a bayesian? American Journal of Physics, 
63:398-410, 1995. 

[Cra51] H. Cramer. Mathematical Methods of Statistics. Princeton University Press, Princeton, 
1951. 

[CS61] D.R. Cox and W.C. Smith. Queues. Chapman and Hall, London, 1961. 

[D’A94] G. D’ Agostini. On the use of the covariance matrix to fit the correlated data. Nuclear 

Instruments and Methods in Physics Research A, 346:306-311, 1994. 

[D’A99] G. D’ Agostini. Bayesian Reasoning in High-Energy Physics: Principles and Applica- 

tions. Technical Report CERN 99-03, CERN, Ginevra, 1999. 

[Dav08] A.C. Davison. Statistical Models. Cambridge University Press, New York, 2008. 

[DE83] P. Diaconis and B. Efron. Computer-Intensive Methods in Statistics. Scientific Ameri- 

can, 248(5):116-130, 1983. 

[DF33] B. De Finetti. Sul concetto di probabilita. Rivista Italiana di Statistica, Economia e 

Finanza, V-4:723-747, 1933. English translation of the paper published in: Statistica, 

LXXVI-3, 2016. 

[DH99] A.C. Davison and D.V. Hinkley. Bootstrap Methods and their Applications. Cambridge 

University Press, Cambridge, 1999. 
[Efr79] B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 


7:1-26, 1979. 
[Efr82] B. Efron. The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Filadelfia, 
1982. 


[ET93] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 
London, 1993. 

[Eva55] R.D. Evans. The Atomic Nucleus. Mc Graw-Hill, New York, 1955. 

[Ft 18] L. Falcioni et al. Report of final results regarding brain and heart tumors in Sprague- 
Dawley rats exposed from prenatal life until natural death to mobile phone radiofre- 
quency field representative of a 1.8 GHz GSM base station environmental emission. 
Environmental Research, 165:496—503, 2018. 

[FC98] G.J. Feldman and R.D. Cousins. A unified approach to the classic statistical analysis of 
small signals. Physical Review D, 57:5873-5889, 1998. 

[Fel47] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1. John 
Wiley and Sons, New York, second edition, 1947. 

[Fey18] R.P. Feynman. Surely You’re Joking, Mr. Feynman!: Adventures of a Curious Charac- 
ter. Norton & Co, London, 2018. 

[Fis41] R.A. Fisher. Statistical Methods for Research Workers. Eighth, London, 1941. 

[Fis96] G.S. Fishman. Monte Carlo Concepts, Algorithms, and Applications. Springer-Verlag, 
New York, 1996. 

[FJ10] M.J. Flegal and G.L. Jones. Batch means and spectral variance estimators in Markov 
chain Monte Carlo. The Annals of Statistics, 38:1034-1070, 2010. 

[FLJW92] A.M. Ferrenberg, D.P. Landau, and Y. Joanna Wong. Monte Carlo Simulation: hidden 
errors from “good” random numbers generators. Physical Review Letters, 69:3382- 
3384, 1992. 

[Fra84] A. Franklin. Forging, cooking, trimming, and riding on the bandwagon. American 
Journal of Physics, 52:786—793, 1984. 

[Fra97] A. Franklin. Millikan’s Oil-Drop Experiments. The Chemical Educator, 2:1-14, 1997. 

[S193] International Organization for Standardization (ISO). Guide to the expression of 
uncertainty in measurement. Technical report, ISO, Ginevra, 1993. 

[Gne76] B.V. Gnedenko. The Theory of Probability. Mir, Moscow, 1976. 

[Gre06] P. Gregory. Bayesian Logical Data Analysis for the Physical Sciences. Cambridge 
University Press, Cambridge, 2006. 

[GS92] G.R. Grimmet and D.R. Stirzaker. Probability and Random Processes. Clarendon Press, 
Oxford, 1992. 


Bibliography 627 


[Has70] W.K. Hastings. Monte Carlo samplings methods using Markov chains and their 
applications. Biometrika, 57:97-109, 1970. 
[HH64] J.M. Hammersley and D.C. Handscomb. Monte Carlo Methods. Chapman and Hall, 
London, 1964. 
[Hix76] J.R. Hixson. The Patchwork Mouse. Anchor Press, New York, 1976. 
[HL07] J. Heinrich and L. Lyons. Systematic errors. Annual Review of Nuclear and Particle 
Science, 57:145-169, 2007. 
[Jam80] F. James. Monte Carlo theory and practice. Reports on Progress in Physics, 43:1145- 
1189, 1980. 
[Jam90] F. James. A review of pseudorandom number generators. Computer Physics Communi- 
cations, 60:329-344, 1990. 
[Jam92] F. James. Minuit Reference Manual. Technical Report CERN D506, CERN, Geneva, 
1992. 
[Jam08] F. James. Statistical Methods in Experimental Physics. World Scientific, London, 2008. 
[Jen06] M. Jeng. A selected history of expectation bias in physics. American Journal of Physics, 
74:578-583, 2006. 
[JLPe00] F. James, L. Lyons, and Y. Perrin (editors). Proceedings of “Workshop on confidence 
limits”. Technical Report CERN 2000-005, CERN, Geneva, 2000. 
[Kah56] H. Kahn. Use of different Monte Carlo sampling techniques. In H.A. Meyer, editor, 
Symposium on Monte Carlo Methods, pages 146-190. J.Wiley and Sons, New York, 
1956. 
[KH96] K. Kacperski and A. Holyst. Phase Transition and Hysteresis in a Cellular Automata- 
Based Model of Opinion Formation. Journal of Statistical Physics, 84:168—189, 1996. 
[Knu81] D.E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms, 
chapter 2. Addison-Wesley, Reading, second edition, 1981. 
[Kol33] A. Kolmogorov. Sulla determinazione empirica di una legge di distributione. Giornale 
dell’ Istituto Italiano degli Attuari, 4:83-91, 1933. English translation of the paper 
published in: A. N. Shiryayev (ed.), Selected Works of A. N. Kolmogorov, volume II, 
139-146, Springer, 1992. 
[KR88] B.W. Kernighan and D.M. Ritchie. The C Programming Language. Pearson Education, 
USA, 1988. 
[KS73] M. G. Kendall and A. Stuart. The Advanced Theory of Statistics, volume II. Griffin, 
London, 1973. 
[KW86] M.H. Kalos and P. Whitlock. Monte Carlo Methods, volume One: Basics. J.Wiley and 
Sons, New York, 1986. 
[Lam66] J.R. Lamarsh. Nuclear Reactor Theory. Addison-Wesley, Reading (Mass.), 1966. 
[Lev60] H. Levene. Robust Tests for Equality of Variances. Contributions to Probability and 
Statistics: Essays in Honor of Harold Hotelling. Stanford University Press, 1960. 
[LSHt 90] N. Lenssen, G. Schmidt, J. Hansen, M. Menne, A. Persin, R. Ruedy, and D. Zyss. 
Improvements in the GISTEMP uncertainty model. Journal of Geophysical Research: 
Atmospheres, 124:6307-6326, 1990. 
[LW65] T.D. Lee and C.S. Wu. Weak interactions. Annual Review of Nuclear Science, 15:381- 
476, 1965. 
[LW18] L. Lyons and N. Wardle. Statistical issues in searches for new phenomena in High 
Energy Physics. Journal of Physics G; Nuclear and Particle Physics, 45(3):033001, 
2018. 
[Lyb84] M. Lybanon. Comment on “Least squares when both variables have uncertainties”. 
American Journal of Physics, 52:276—278, 1984. 
[Lyo13] L. Lyons. Discovering the Significance of 5 sigma. Eprint: https://doi.org/10.48550/ 
arXiv.1310.1284, 2013. 
[MGB73] A.M. Mood, F.A. Graybill, and D.C. Boes. Introduction to the Theory of Statistics. 
McGraw-Hill, New York, 1973. 
[MNZ90] G. Marsaglia, B. Narashimhan, and A. Zaman. A random number generator for PC’s. 
Computer Physics Communications, 60:345-349, 1990. 


628 Bibliography 


[Mon03] D. C. Montgomery. Design and analysis of experiments. John Wiley & sons, New York, 
2003. 
[Mor84] B.J.T. Morgan. Elements of Simulations. Chapman and Hall, London, 1984. 
[MRR*53] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. 
Equations of State Calculations by Fast Computing Machines. the Journal of Chemical 
Physics, 21:1087—1092, 1953. 
[Ney37] J. Neyman. Outline of a Theory of Statistical Estimation Based on the Classical Theory 
of Probability. Phil. Trans. Royal Society London, 236:333-380, 1937. 
[Ons44] L. Onsager. A Two Dimensional Model with an Order-Disorder Transition. Physical 
Review, 65:117-149, 1944. 
[Ore82] J. Orear. Least squares when both variables have uncertainties. American Journal of 
Physics, 50:912-916, 1982. 
[PFTW92] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Wetterling. Numerical Recipes: 
The Art of Scientific Computing. Cambridge University Press, Cambridge, second 
edition, 1992. 
[Pop59] J. R. Popper. The Logic of Scientific Discovery. Hutchinson & Co, London, 1959. 

[PS20] P. Pedroni and S. Sconfietti. A new Monte Carlo-based fitting method. Journal of 
Physics G; Nuclear and Particle Physics, 47(5):05401, 2020. 

[PUP02] A. Papoulis and S. Unnikrishna Pillai. Probability, Random Variables and Stochastic 
Processes. McGraw Hill, Europe, 2002. 
[RC99] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, New 
York, 1999. 
[Rip86] B. Ripley. Stochastic Simulation. J.Wiley and Sons, New York, 1986. 
[Ros96] S.B. Ross. Simulation. Academic Press, London, second edition, 1996. 
[Rotl0] A. Rotondi. On frequency and efficiency measurements in counting experiments. 
Nuclear Instruments and Methods in Physics Research A, 614:106-119, 2010. 
[RPP] A. Rotondi, P. Pedroni, and A. Pievatolo. Text web site: https://tinyurl.com/ 
ProbStatSimul. 

[RS17] C. Rothleitner and S. Schlamminger. Invited Review Article: Measurements of the 
Newtonian constant of gravitation, G. Review of Scientific Instruments, 88:111101.25— 
111101.28, 2017. 

[Rub81] R.Y. Rubinstein. Simulation and Monte Carlo Method. J. Wiley and Sons, New York, 
1981. 

[Rue96] D. Ruelle. Chance and Caos. Princeton Science Library, Princeton, 1996. 

[RvN63] R.D. Richmeyer and J. von Neumann. Statistical methods in neutron diffusion. In A.H. 
Taub, editor, John von Neumann collected works, volume V, pages 751-767. Pergamon 
Press, Oxford, 1963. 

[Spi61] M.R. Spiegel. Statistics. McGraw-Hill, New York, second edition, 1961. 

[Ste97] I. Stewart. Does God Play Dice? The New Mathematics of Chaos. Penguin Books, 
London, 1997. 

[SW89] G.A.F. Seber and C.J. Wild. Nonlinear Regression. Wiley Interscience, New York, 
1989. 

[TC93] R. Toral and A. Chakrabarti. Generation of gaussian distributed random numbers by 
using a numerical inversion method. Computer Physics Communications, 74:327-334, 
1993. 

[Tea22] GISTEMP Team-2022. GISS Surface Temperature Analysis (GISTEMP), version 4. 
NASA Goddard Institute for Space Studies; https://data.giss.nasa.gov/gistemp/, 2022. 
[Online; dataset accessed 2022-02-07]. 

[Tuk49] J. Tukey. Comparing Individual Means in the Analysis of Variance. Biometrics, 5:99- 
114, 1949. 

[vD07] J. van Dongen. Emil Rupp, Albert Einstein, and the canal ray experiments on wave- 
particle duality: Scientific fraud and theoretical bias. Historical Studies in the Physical 
and Biological Sciences, 37, Supplement:73—120, 2007. 


Bibliography 629 


[Voi03] J. Voit. The Statistical Mechanics of Financial Markets. Springer-Verlag, Berlin, second 
edition, 2003. 

[W*18] C.J. Werner et al. MCNP Version 6.2 Release Notes. Technical Report LA-UR-18- 
20808, Los Alamos Laboratories, Los Alamos, 2018. 

[Wel47] B.L. Welch. The generalization of “Student’s” problems when several different popu- 
lation variances are involved. Biometrika, 34:28-35, 1947. 

[Wik22] Wikipedia. Sally Clark—Wikipedia, the free encyclopedia. https://en.wikipedia.org/ 
wiki/Sally_Clark, 2022. [Online; accessed 2022-07-02]. 

[ZeaPDG20] P.A. Zyla et al. (Particle Data Group). Review of Particle Physics. Progress in 

Theoretical and Experimental Physics, 2020:083C01, 2020. 


Index 


A 
Accuracy, 530 
classes of, 528 
table of, 528, 553 
Algebra-o, 13 
Algorithm 
Box and Muller, 344 
discrete generation, 325 
Gaussian generation, 345 
inverse transformation, 328 
linear search, 335 
Metropolis-Hastings, 394 
optimized rejection, 338, 341 
Poissonian generation, 347 
simple rejection, 337, 341 
weighted rejection, 340, 342 
Analysis of variance (ANOVA), 262, 512 
interaction, 310 
one way, 299 
two-way, 309 
Arrangements, 24 
Astrology, 120 
Average 
of distributions, 62 


B 
Bandwagon effect, 575 
Batch means, method of, 381, 396 
Bayesian 
approach, 12 
Bernoulli, J., 6 
Best fit, 413 


Best-fit curve, 478 
Bias, 226, 248, 421 
Bin, 46 
Binomial or Bernoulli 
distribution, 51, 75, 117 
Bivariate 
Gaussian distribution, 139 
Boltzmann, H. 
constant, 107 
distribution, 107 
equation of, 370 
Bootstrap, 353 
bias, 360, 363 
non parametric, 359 
parametric, 353, 547, 549 
Bound 
of Cramér-Rao, 433 
Buffon, count de, 319 


Cc 
Cauchy-Schwarz 
inequality, 134 
theorem, 134 
Causality, 135 
Cause-effect, 136 
Centre of mass, 373 
Chaos, 2 
Chaotic, phenomenon, 2 
Cochran’s theorem, 154 
Coefficient 
of determination, 512 
R? corrected, 513 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
A. Rotondi et al., Probability, Statistics and Simulation, La Matematica per il 3+2 


139, https://doi.org/10.1007/978-3-03 1-09429-3 


631 


632 


of variation, 188 
Combinations, 24 
Composite hypotheses, 463 
Concentration ellipses, 155 
Conditional density, 127 
Conditional mean, 129 
Confidence 

band of, 201 

interval of, 201, 203 
Contingency table, 285, 288 
Control batch, 461 
Control chart, 238 
Control zone, 238 
Convergence 

almost sure, 70 

in distribution, 70 

in probability, 70 

strong, 70 

weak, 70 
Convolution, 171, 566 
Correction 

of Bonferroni, 292, 295 

continuity, 219 

of Sheppard, 532, 557 

of Sidak, 292 
Correlation, 134, 479 

coefficients, 135, 150, 192, 248, 255, 361, 

480 

definition, 135 

function of, 480 

linear, 485 

search for, 511 
Covariance, 127, 131, 134, 187, 246 

between LS parameters, 488 

sample, 248, 249 
Coverage, 203, 356 
Critical region, 259, 450, 451, 457 
Cumulative 

frequency, 48 

function, 44 
Curtiss, L.F., 559 


D 
Decile, 46 
Deconvolution, 171 
Degrees of freedom, 226, 289, 439 
Density 
conditional Gaussian, 145 
definition, 49, 56 
Gaussian bivariate, 156 
joint, 126 
stable, 176 
standard Gaussian, 615 


Index 


Student, 263 
Determination 
coefficient of, 512 
Deviate, 41, 88 
Diffusion 
elastic, 371 
length, 377 
of neutrons, 370 
of particles, 370 
Direction cosines, 373 
Distortion factor, 226 
Distribution, 41 
binomial or Bernoulli, 10, 243, 416 
bivariate Gaussian, 344 
Boltzmann, 107 
of Cauchy, 602 
of the correlation coefficient, 250 
gamma, 99, 168, 596 
Gaussian, 81 
Gaussian bivariate, 156 
geometric, 78, 96 
half-normal, 366 
hypergeometric, 593 
logarithmic, 596, 612 
log-normal, 594 
Maxwell, 106 
Monte Carlo simulation of a, 348 
multinomial, 159, 196, 243 
multivariate Gaussian, 152 
N(p, 07), 89 
non-central F, 308 
non-central x7, 308 
Poisson, 79, 462 
Poissonian or of Poisson, 559 
Rayleigh, 157 
sample, 262 
of Snedecor, 182, 298, 506, 617 
of Student, 179, 230, 616 
uniform, 108 
of von Mises, 412 
Weibull, 98 
x7, 101, 227, 616 


E 
Effect 

butterfly, 4 

placebo, 573 

systematic, 525 
Efficiency 

of generation, 337 

of simulation, 403 
Election results, 215 
Electrical resistance, 236 


Index 


Ellipses of concentration, 142 
Energy of molecules, 107 
Equations 
of Boltzmann, 370 
of Clopper-Pearson, 209 
detailed balance, 393 
transport, 370 
Error 
bars, 278, 477, 563 
maximum, 552 
offset, 533, 534 
parallax, 530 
percentage, 546 
plug-in, 199, 214, 219 
on the polarization, 578 
propagation of, 542 
readjusting, 516 
rescaling, 516 
significant digits, 208 
simulation of, 547, 551 
statistical, 199, 530, 551 
systematic, 529 
type I, 260, 294, 450, 466, 573 
type II, 260, 280, 294, 452, 462, 466 
zero-setting, 533 
Estimate, 68 
Estimation 
interval, 434 
Estimator, 69, 71, 200 
asymptotically correct, 226 
BAN, 433 
biased, 226 
consistent, 421 
inadmissible, 428 
of the mean, 71 
most efficient, 422 
unbiased, 72, 421 
of variance, 71 
Events 
definition, 8 
incompatible, 19 
independent, 19 
Expected 
value, 27, 76, 214, 245, 323 
Experiment 
of the coins, 46, 47, 54, 66, 77, 322, 443 
coin tossing, 86 
definition, 8 
of Millikan, 569 
ten coins, 282 
Experimental 
value, 199 
Extrapolation, 511 
Extrasensory perception, 265 


F 
False 
alarm, 238 
negative, 260 
positive, 260 
False discovery rate (FDR), 294 
Falsifiability, 261 
Falsification, 573 
Family-wise error rate (FWER), 294 
Feigenbaum, M., 3 
Fermat, P., 5 
Fermi, E., 320 
Feynman, R., 572 
File R 
warpbreaks, 310 
Fisher, R.A., 415 
transformation of, 252 
Fluctuation 
statistical, 54 
Formula 
of Clopper-Pearson, 355 
of Sidak, 292 
of Stirling, 81 
of Wald, 214, 355 
weighted average, 446 
of Welch-Satterthwaite, 270 
of Wilson, 213, 355 
Frequency, 47, 209 
analysis of, 290 
of arrival, 97 
in a bin, 161 
of decay, 560 
of a disease, 287 
of emission, 94 
experimental, 58 
limit of, 12 
measured, 214, 224, 282 
observed, 417 
test of, 285 
Frequentist 
approach, 12, 201 
Function 
apparatus, 565 
cumulative, 89, 97 
error, 89 
generating (Mgf), 587 
likelihood, 415 
of a random variable, 43 
test, 459 


G 
Galilei, G., 523 
GEANT, 370 


634 Index 


H Ising, model of, 397 
Hard science, 524 Isotropy, 331 
Height, 255 
Heteroskedasticity, 505 
Histogram, 46 K 
best fit to, 439 Kolmogorov 
bin of, 46, 243 inequality of, 71,232 
channel, 48 probability, 13, 15 


of cumulative frequencies, 48 
of frequencies, 47 


normalized, 47 L 
Homeopathy, 573 Laplace, P.S., 5, 475 
Hypergeometric law, 25 Law 
Hypothesis logistics, 3 
alternative, 450 negative exponential, 95 
composite, 463 of propagation of errors, 542 
null, 117, 259, 450 strong of large numbers, 70 
simple, 463 3-sigma (30°), 87, 113, 157, 278, 551,571 
test, 9 weak of large numbers, 70 


Least squares 
method of, 476 
weighted, 495 


I Legendre, A., 475 
lid variables, 67 Level 
Independence of confidence, 201 
and correlation, 135 of probability, 88, 155 
of events, 19 of significance, 117, 201, 260, 261, 265, 
of Gaussian variables, 141 452 
stochastic, 43 of the test, 260, 452 
of variables, 43, 128 Likelihood 
of variables theorem, 127 function, 415 
Independent logarithm of, 415 
experiments, 22 ratio, 457, 463 
variables, 128 Limit 
Inequality frequentist, 11, 69 
Cauchy-Schwarz, 428 poissonian, 219 
of Tchebychevy, 233 in probability, 70 
Information, 426 Line 
Instrument regression, 145, 481, 489 
analog, 527 Logistic map, 3 
digital, 526 Lower 
sensitivity, 526 limit (estimation of), 204 
Integral 
convolution, 171, 566 
with crude MC, 402 M 
folding, 566 Macroscopic cross section, 370 
with hit or miss method, 401 Marginal density, 127 
multidimensional, 410 Markov chain, 393 
Interpolation Matrix 
linear, 335 correlation, 150, 194 
Interval covariance, 150, 189, 190, 194 
of confidence, 351 curvature, 519 
of probability, 115 positive definite, 151 


sensitivity, 526 of second derivatives, 518 


Index 


transport, 189, 191, 195, 536 
weight, 496 
Maximum likelihood, 415 
MCNP, 370 
Mean 
definition, 57, 60 
estimation of, 222, 229 
estimator of, 69, 71 
of histograms, 59 
as operator, 64 
sample, 58, 68, 236 
sample (finite population), 240 
weighted, 446, 555 
Measure 
of the charge, 569 
CP violation, 577 
error of, 530 
indirect, 542 
of Michel’s parameter, 576 
precision of, 529 
of radioactive decay, 559 
Measurements 
definition, 8 
error, 554 
of the falling time, 548 
of light velocity, 555 
operation of, 524 
type of, 551 
Media 
as operator, 64 
Method 
best fit, 413 
gradient, 518 
grid, 352 
hit or miss, 401 
least squares, 438, 476 
maximum deviation, 410 
maximum likelihood, 416 
Monte Carlo, 319 
Millikan, R., 569 
Moments, 61, 226 
Monte Carlo methods, 6, 110, 319 
Multiple correlation 
coefficient of, 512 
Multivariate 
Gaussian distribution, 160 


N 

Negative exponential 
distribution, 95 

Neyman, J., 201 

Nonparametric tests, 285 

Normalization, 48 


635 


O 

Odds ratio, 258, 367, 598 
Operational research, 377 
Operator projection, 154 
Orthogonalization, 495 
Over-coverage, 203 


P 
Paired samples, 273 
Pearson, K. 
theorem, 104, 153, 279 
Percentile, 46, 182, 244 
Phenomena 
stochastic, 94 
waiting, 377 
Pivotal quantity, 206, 214, 230, 231 
Poincaré, H., 2 
Poisson 
limit, 219 
process, 94 
Polarization 
measurement of, 578 
Popper, K., 261,573 
Population 
definition, 8 
Postulate of objectivity, 572 
Power 
function, 463 
of the test, 293, 308, 450 
Pranotherapy, 573 
Precision, 530 
Predictor, 476 
observed, 477 
unobserved, 481 
Probability 
axiomatic, 13 
Bayesian, 11 
compound, 17 
conditioned, 18 
definition, 8 
estimation of, 209 
frequentist, 11 
interval, 115 
limit in, 70 
non-epistemic, 8 
a priori, 11 
space, 15 
subjective, 11 
total, 26 
Problem 
of the encounter, 36, 366 
Monty’s, 36, 366 
Process 


636 


collision, 373 
stochastic, 372 
Property 
valid in strong sense, 294 
valid in weak sense, 294 
Pull quantity, 534 
p-value, 261, 280 


Q 

Q-Q plot, 63, 593 

Quantile, 45, 298, 616 
of Welch, 270 


R 
Random 
system or process, 2 
variable, 40 
walk, 122 
Refractive index, 353 
Regression 
curve of, 478 
linear, 145 
multiple, 493 
Regression (or best-fit) curve, 503 
Research 
sequential, 325 
Residuals, 487, 503 
weighted, 504 
Routine 
BootCor, 361 
CovarHisto, 132 
Batchmeans, 400 
BayesBobAIl, 33 
BinPoisTest, 80 
BootCor, 361 
Bootgrid, 354 
BootPermTest, 364 
Buffon, 603 
Chi2Testm, 275 
Coinfit, 444 
Combn, 25 
ConvFun, 172, 173, 177 
CorrelEst, 138, 250, 598 
CovarTest, 250, 363 
Dispn, 25 
FitLike, 439 
FitLineBoth, 483, 609, 610 
FitMat, 539, 541 
FitPlolin, 508 


FitPolin, 508, 510, 514, 520, 610 


Gauss2, 345 
Gaussfit, 442 


GdiffMean, 265 


GdiffProp, 264, 267, 288 
HistoBar, 49, 242, 246 
HistoBar3D, 146, 147 


Linemg, 482, 520 


Linfit, 477, 493, 498, 508, 514 


LogiPlot, 4 
Logist, 4 
MCasimm, 603 
MCasinc, 386, 388 
MCbinocoy, 355 
MCcoin, 321 
MCdelta, 603 
MCdetector, 605 
MCdices, 326 
MCDiffProp, 365 
MCellipse, 604 
MCgauss, 345 
MCgauss1, 344, 602 
MCegrid, 353 


MCinteg, 403, 408, 411 


Mcinteg1, 604 
MCintopt, 409, 604 
MCising, 399 
MCKolmoDist, 390 
MCmetrop, 396, 604 
MCneutrons, 374 
MCpoiss, 347 
MCpoisscov, 358 
MCrefrac, 349 
MCsinc, 382 
MCsphere, 333 
MCsystemp, 535 
MCsystems, 534 
MCtTukey, 603 
MCvmises, 605 
MCxsinx, 342 
MeanEst, 232, 448 
MeanHisto, 63 


for Metropolis algorithm, 395 


MeanHisto, 224 
MultiTest, 296 


Nlinfit, 442, 477, 518, 519, 562 


Perm, 25 
Poiss.App, 221 
PoissApp, 219, 258 
Sequen, 469 
Sigdel, 545 

StaSys, 545 
Stimvs, 235 


for synchronous simulation, 383, 392 
synicronous simulation, 386 


TdiffMean, 271, 601 
TpTest, 272 


Index 


Index 


VarEst, 228, 232 

VarHisto, 63, 224 
Routine R, 120 

%*% multiplication, 196 

abline, 63 

aov, 303, 304, 313, 314 

bartlett.test, 304, 313 

binom.test, 211, 212, 217, 356 

bkde2D, 586 

cdft, 616 

chisq.test, 288, 291 

choose., 24 

combn, 25 

cor, 151 

cov, 138, 151 

covar, 249 

dbinom, 52 

dchisq, 105 

density, 164, 272, 275, 585 

dgamma, 100 

dhyper, 25 

dnorm, 89 

f, 183 

factorial, 24 

findInterval, 326 

hist, 49, 242 

integrate, 172, 604 

ks.test, 390, 392 

leveneTest, 304 

Im, 493 

max, 326 

mean, 62, 224 

mle, 439 

optim, 439, 539 

p.adjust, 293 

pehisq, 105, 280, 291 

persp, 586 

plot, 164, 172, 356 

pnorm, 116, 119, 263, 296, 615 

poisson.test, 219-221, 258 

power.anova.test, 308 

power.prop.test, 455 

power.t.test, 454 

prop.test, 215,216, 218 

ptukey, 317 

qchisq, 616, 617 

qf, 617 

qnorm, 91 

qtukey, 306 

quantile, 46, 360 

qunif, 64 

rayleigh, 157 

rbind, 288, 291 

rbinom, 322, 356 


637 


rchisq, 105 

replicate, 272 

rexp, 386 

rgamma, 112 

morm, 272, 291, 296, 412 
rpois, 275, 347 

runif, 63, 112, 324, 329 
sample, 359 

summary, 303, 313 
system.time, 326 

t, 179 

t.test, 274 

TukeyHSD, 306, 313 
twot.permutation, 364 
unif, 138 

var, 62, 138, 224 
Vectorize, 172 
weighted.mean, 448 
which, 322, 326 


Ny 

Sample 
analysis of the, 242 
definition, 8, 67 
parameters, 58 
value, 58 

Sampling 


definition, 8 
by importance, 405 
methods of, 216 
with replacement, 239 
stratified, 405 
Scientific method, 572 
Score function, 426 
Search 
dichotomic, 325 
Sensitivity, 28, 526 
interval of, 526 
Simulation 
asynchronous, 378, 385 
discrete, 378 
synchronous, 378 
Space 
sample, 8 
Specificity, 28 
Spectrum 
continuous, 44 
definition, 8, 44 
discrete, 44 
Standard deviation 
definition, 57, 60 
estimate of, 228 
Standard Gaussian density, 87 


638 


Statistic 

definition, 68 

jointly sufficient, 422 

sufficient, 422 
Statistical correlation, 137 
Statistical fluctuation, 54, 115, 244, 262, 481 
Statistical inference, 9 
Statistical uncertainty, 199 
Stochastic 

chance or phenomenon, | 

sums, 377 

system or process, 2 
Straight line 

of least squares, 485 
Student 

t-test, 270 
Sum of squares, 59 

total, 512 
Support, 43 
Systematic 

effect, 525 

error, 525, 528 


T 
Tchebychev inequality, 113 
Test 
BH, 294 
diagnostic, 28 
difference, 273, 561 
double-blind, 573 
F,298 
of a hypothesis, 259 
Kolmogorov-Smirnov, 388 
of more hypotheses, 450 
most powerful, 459 
of Neyman-Pearson, 457 
one-tailed, 118, 260 
permutation, 364 
power of, 293, 450, 457 
randomized, 262, 461 
of the resistances, 236 
of rubber belt, 290 
sequential, 468 
t, 262, 269 
of Tukey, 603 
two-tailed, 118, 260, 501 
uniformly most powerful, 463 
vaccine, 287 
of variance, 262 
of Welch, 270 
x7, 143, 161, 262, 274, 279, 280, 284, 285, 
289, 441 
Zz, 263 


Index 


Theorem 
about cumulative variables, 109 
of additivity, 16, 295 
additivity of the chi-square variable, 105 
asymptotic normality, 433 
Bayes, 28, 31 
Benjamini-Hochberg, 296 
on binomial and Poissonian variables, 244 
central limit, 92, 176 
on the change of variable, 168 
Cochran, 154, 230, 506 
of the compound probabilities, 18 
on correct estimates, 499 
Cramér-Rao, 427 
on cumulative variables, 389 
de Moivre-Laplace, 84 
factorization, 422 
of Fisher’s z, 250 
on function of parameters, 424 
Gauss-Markovy, 499 
Glivenko-Cantelli, 389 
independence between mean and variance, 
229 
on the independence of variables, 128, 141 
on least squares estimates, 501 
Leibnitz, 165 
mean value, 242 
on the Metropolis sample, 395 
most efficient estimator, 430 
Neyman-Pearson, 457 
partition, 26 
Pearson, 274 
of the Pearson sum, 160 
on the p-value p.d.f., 261 
on quadratic forms, 153 
on sample variance, 230 
of stochastic independence, 101 
on sufficient statistics, 423 
Thermal neutrons 
absorption of, 371 
elastic scattering of, 371 
Thoracic perimeter, 490 
Time 
of arrival, 95 
dead, 95, 560 
falling, 548 
Trial 
definition, 8 
repeated, 19 
Triangular 
distribution, 175, 528 
True 
parameters, 58 
value, 58 


Index 


Tukey 
quantile, 367 
test, 305 


U 
Ulam, S., 320 
Uncertainty 
statistical, 525 
systematic, 525 
Uniform 
distribution, 552 
Upper 
limit (estimation of), 204 


Vv 

Value 
p, 261,599 
expected, 54, 65 

Variables 
&, 321 
constrained, 153 
continuous, 44, 54 
cumulative, 109, 324 
discrete, 44 
dummy, 42, 66 
F of Snedecor, 506 
iid, 67 
independent, 43, 127 
modified x7, 275 
N(w, a), 89 
Poissonian, 245 
pooled, 264, 288 
product, 188 
random, 5 
reduced x, 231 
Snedecor F, 180 


standard, 88, 263, 571 


stochastically independent, 43 


Student, 178, 269 
uniform, 109, 329 
Z of Fisher, 251 
Variance 
analysis of, 298, 299, 512 
definition, 57, 60 
of distributions, 62 
effective, 482 
estimation, 224, 229 
estimator of, 68, 71 
of histograms, 59 


of importance sampling, 405 


of the mean, 223 


of the mean (finite population), 241 


percentage or relative, 546 
of the product, 188 
relative of percentage, 188 
sample, 58, 68 
of stratified sampling, 406 
Variate, 41 
Vectors 
orthogonal, 153 
Vertex, determination of, 520 
Von Mises 
distribution of, 412 
probability, 11 
Von Neumann, J., 320, 339 


Ww 


Waist circumference, 255 


Y 
Yates 
correction of, 274 


639 


